ZFIN ID: ZDB-PUB-121121-1
Viral Insertion Mutants Overwrite Data
Burgess, S., and Lin, S.
Viral integration sites were amplified by linker-mediated PCR with 6base sequence “barcodes” associated with each fish. Samples were pooled and sequenced on an Illumina HiSeq2000. Two different mapping strategies were used to improve recovery efficiency:
1) Prior to mapping the reads, a custom script was run to trim off the 3'’ retroviral LTR and linker cassette (LC) sequences and identify the barcodes. The six nucleotides directly adjacent to the LC primer sequence was the identifying barcode. Sequences were trimmed of the 3'’ LTR and/or LC sequence as well as the barcode, the barcode was noted for sample identification, and the trimmed sequence was used for mapping integrations. Bowtie was used to map the trimmed retroviral sequence tags to the Zv9 zebrafish genome assembly, allowing for one mismatch. The integration site was defined as the site of LTR insertion.
After mapping, corresponding ends were paired and uniquely mapping read pairs were used to identify insertion sites. If a single mate of a pair contained both the 3’ 'LTR and the barcode, that single sequence, if mapped to only one position in the genome, was used to map the integration site. Reads were defined as redundant and subsequently collapsed if they aligned to the same chromosome at the same start position (the genomic location of the integration), occurred on the same DNA strand and had the same barcode. The number of redundant sequences was recorded. Integration sites with e60 redundant sequence reads were used for downstream analyses.
The genomic position of retroviral integrations was compared to those of zebrafish gene models obtained from Ensembl Zv9 e65. A custom perl script was run to identify those retroviral insertions that occurred within a gene or 1 kb upstream or downstream of a gene.
2) We assembled a curated, single-ended library from the original paired end reads. A brute force exact alignment algorithm was used to align the paired reads along their overlapping regions and to find the location of both the linker and LTR sequences. Flanking sequences were extracted and aligned to the zebrafish genome assembly Zv9 using Bowtie with a tolerance of one mismatch. Only reads longer than 11 nt and with unambiguous alignments were used to pinpoint the insertion locus.
Some flanking sequences were sufficiently long that the paired reads did not overlap. In these cases, an oriented pseudo single end sequence was generated. The resulting flanking sequences were separately mapped to Zv9 using Bowtie. Multiple hits were filtered to keep a maximum of 20 hits per read. In our model, the paired flanking sequences have a unique alignment if hits from both sequences are aligned to the same chromosomal region, same strand orientation, are at a distance of less than 1kb between hits, and there is only one hit-pair that meets the above requirements.
All unique hits from the pre-processing step were pooled, and integration coordinates were extracted from the Bowtie mapping output. The integration site was defined as the genomic coordinate immediately adjacent to the portion of the read to which the 3'’ LTR had been attached. The same integration event is typically sequenced multiple times in the library. Since multiple restriction enzymes were used to digest the genome DNA in this study, it is possible that the same integration DNA fragment was generated at different lengths and would vary at the 5'’-linker-barcode end. Therefore the 3'’-LTR end position was used to determine redundant sequences for each barcode or sample, and the longest fragment was used as representative to report and display the insertion locus.
For reporting and viewing the data, a bed-formatted file was produced, which give the chromosome, flanking sequence start, flanking sequence end, barcode, frequency (number of reads per integration site) and orientation of each integration and is available as a download from http://research.nhgri.nih.gov/ZInC/. The gene annotation file Ensembl ZV9 e65 was used to build a “One gene, One transcript” gene structure model (see Supplementary Figure 2) as the exonic union of all the annotated transcripts. Finally, bedtools was used to determine the overlap between the integration and the gene model, and integration sites were annotated as explained above.