FIGURE SUMMARY
Title

An improved zebrafish transcriptome annotation for sensitive and comprehensive detection of cell type-specific genes

Authors
Lawson, N.D., Li, R., Shin, M., Grosse, A., Yukselen, O., Stone, O.A., Kucukural, A., Zhu, L.
Source
Full text @ Elife

(A, B) Volcano plots showing differentially expressed genes from Tg(kdrl:HRAS-mCherry)s896-positive and negative (kdrlpos and kdrlneg) cells identified using RNA-seq reads quantified with (A) RefSeq or (B) Ensembl, version 95 (Ens95) transcript annotations. Genes with significant enrichment (padj <0.05) are shown as red or blue (log2 fold change >1 or <-1, respectively) across replicate samples (n = 3). Grey dots are genes that fall below statistical cutoffs. Green dots indicate selected genes previously found to be enriched in endothelial cells in zebrafish or other models. (C) Venn diagram illustrating the intersection of genes with a common NCBI ID in Ens95 and RefSeq found significantly enriched in kdrlpos cells using either annotation. (D, E) Plots of commonly annotated genes identified as kdrlpos-enriched only by Ens95 with indicated values from (D) Ens95 or (E) RefSeq. (F, G) Correlation of expression levels from indicated annotation for kdrlpos-enriched genes identified (F) selectively as such by Ens95 (maroon) or RefSeq (grey) only, or (G) both annotations. Data are not normally distributed, Spearman correlation, r values are indicated.

(A) Log10 average expression (n = 3) for kdrlpos-enriched genes as quantified by each indicated annotation. Each separate plot shows genes identified as kdrlpos-enriched only using or Ens95 or RefSeq. Data are not normally distributed, Wilcoxon matched-pairs signed-rank test, p values are indicated. (B) Venn diagram of intersection for genes with a common NCBI ID in Ens95 and RefSeq found significantly enriched in pdgfrbpos cells (log2 fold change [pdgfrbpos/pdgfrbneg]>1,adjp <0.05) using each annotation. (C, D) Volcano plots showing differentially expressed genes from TgBAC(pdgfrb:citrine)s1010-positive and negative cells identified using RNA-seq reads quantified using (C) RefSeq or (D) Ens95 annotations. Genes with significant differences (padj <0.05) are shown as red (log2 fold change >1) or blue (log2 fold change<-1). Grey dots are genes that fall below these statistical cutoffs; n = 3. Green dots indicate selected genes previously identified as expressed in vascular smooth muscle cells or pericytes (see main text). (E) Plots of commonly annotated genes identified as pdgfrbpos-enriched only by Ens95 with indicated values from Ens95 or RefSeq. (F) Correlation of expression levels from indicated annotation for pdgfrbpos-enriched genes identified selectively as such by Ens95 (maroon) or RefSeq (grey) only (left plot) or in both annotations (right plot). Data are not normally distributed, Spearman correlation, r values are indicated. (G) Log10 average expression (n = 3) for pdgfrbpos-enriched genes as quantified by each indicated annotation. Each separate plot shows genes only identified as pdgfrbpos-enriched using Ens95 or RefSeq. Data are not normally distributed, Wilcoxon matched-pairs signed-rank test, p values are indicated.

(A, B) Plots showing 3' UTR length from matched reference genes from indicated annotation identified as enriched only in Ens95 or RefSeq. Mean 3' UTR length for each group is shown, error bars denote mean and standard deviation. No statistical comparison is presented since these data are already defined as longer (>50 nt) in indicated annotation. (A) kdrlpos-enriched genes. (B) pdgfrbpos-enriched genes.

Analysis of RNA-seq reads from a random-primed library.

(A,B) Volcano plots of differentially expressed genes from Nr2f2-positive and -negative (Nr2f2pos and Nr2f2neg) endothelial cells identified using RNA-seq reads quantified with (A) RefSeq or (B) Ensembl, version 95 (Ens95) transcript annotations. Genes with significant enrichment (padj <0.05) are shown as red or blue (log2 fold change >1 or <-1, respectively). Grey dots are genes that fall below statistical cutoffs. (A, B) Green dots are selected known vein-specific genes. (C) Venn diagram of genes with a common NCBI ID in Ens95 and RefSeq identified as significantly Nr2f2pos-enriched using either annotation. (D) Correlation of log10 average expression levels (n = 3) from indicated annotation for Nr2f2pos-enriched genes identified selectively as such by Ens95 or RefSeq only (left plot) or both annotations (right plot). Data are not normally distributed, Spearman correlation, r values are indicated. (E) Log10 average expression (n = 3) for Nr2f2pos-enriched genes as quantified by each indicated annotation. Separate plots shown for genes selectively identified as Nr2f2pos-enriched using Ens95 or RefSeq. Data are not normally distributed, Wilcoxon matched-pairs signed-rank test, p values are indicated. (F) Plots of commonly annotated genes identified as Nr2f2pos-enriched only by Ens95 with indicated values from Ens95 (left plot) or RefSeq (right plot). (G) Pie charts showing the proportion of reference genes identified as Nr2f2pos-enriched by Ens95 and RefSeq with indicated relative 3' UTR length. (H) Correlation of log10 average expression from Nr2f2pos RNA-seq (n = 3) quantified with each annotation for matched reference genes with same 3' UTR length or longer Ens95 or RefSeq 3' UTR length. Data are not normally distributed, Spearman correlation, r values are indicated. (I) Log10 average expression (n = 3) as quantified by Ens95 or RefSeq for Nr2f2pos-enriched genes lacking an Ens95 3' UTR annotation. Data are not normally distributed, Wilcoxon matched-pairs signed-rank test, p value is indicated. (J) UCSC browser image of the erg gene (on minus strand) showing 3' UTR annotation and lack thereof from RefSeq and Ens95, respectively. Mapped depth of RNA-seq reads from Nr2f2pos cells captured for each annotation is indicated. Genome-mapped consolidated reads from GSE32900 are also shown as is the location of 3P-seq feature, also on the minus strand.

(A, B) Log10 average expression as quantified using indicated annotation for (A) kdrlpos- or (B) pdgfrbpos-enriched genes identified as such only in RefSeq and lacking an Ens95 3' UTR annotation. Expression levels for genes from each annotation with matched NCBI ID are shown in each case. Data are normally distributed (Shapiro-Wilks test), paired t-test, p values are indicated; n = 3 (i.e. each point represents an average value from three separate RNA-seq replicates). (C) UCSC browser image of slc7a5 locus on the minus strand showing 3' UTR annotations from Ens95 and RefSeq. Mapped read depth from kdrlpos cells on the genome, or assigned to each annotation are indicated, as is a 3P-seq feature. The GSE32900 track is consolidated RNA-seq reads from all stages indicated in Figure 3A. The location of a putative missing 3' UTR is indicated. (D) Pie chart showing numbers of reference genes with the same or longer 3' UTRs in each indicated annotation. (E) Pie charts showing the proportion of reference genes selectively identified as kdrlpos- or pdgfrbpos-enriched by Ens95 and RefSeq with indicated relative 3' UTR length. (F, G) Correlation plots showing log10 average expression from kdrlpos RNA-seq (n = 3) quantified with each annotation for matched reference genes with (F) longer Ens95 (maroon) or RefSeq (light blue) 3' UTR, or (G) same 3' UTR length. Data are not normally distributed, Spearman correlation, r values are indicated. (H, I) UCSC browser images of (H) sox17 and (I) cspg4 loci, both on the minus strand, showing 3' UTR annotations from Ens95 and RefSeq. Mapped read depth of RNA-seq from (H) kdrlpos or (I) pdgfrbpos cells captured for each annotation is indicated. Consolidated reads from GSE32900 and location of 3P-Seq features are indicated, as is putative missing 3' UTR in cspg4.

(A) Schematic outline for generating a new zebrafish transcriptome annotation. See Results and Materials and methods sections for details. (B) Pie charts showing the proportion of reference genes with same, longer or shorter 3' UTR in the V4.3 annotation compared to relative 3' UTR length between Ens95 and RefSeq. (C, E) Venn diagrams showing intersection of reference genes with commonly annotated NCBI ID that are significantly enriched in (C) kdrlpos- or (E) pdgfrbpos-cells in each indicated annotation. (D, F) Volcano plots of reference genes with common NCBI ID identified as (D) kdrlpos- or (F) pdgfrbpos-enriched only by Ens95 in comparison to RefSeq. Indicated values are from the same genes quantified using V4.3. Red dots indicate log2 fold change >1 and adjp <0.05. (G) 3' UTR lengths and (H) log10 average expression (n = 3) using RefSeq and V4.3 for reference genes in the indicated dataset. (G, H) Data are not normally distributed, Wilcoxon matched-pairs signed-rank test, p values are indicated. Error bars denote mean and standard deviation. (I) Log10 average expression (n = 3) and (J) 3' UTR lengths across all annotations for reference genes uniquely identified as enriched in indicated transgene-positive cell type using V4.3 (log2 fold change >1, padj <0.05). Data are not normally distributed. Friedman test to assess variance (p<0.0001 in all cases). Dunn's multiple comparison test was used for pairwise comparisons, p values are indicated. Error bars denote mean and standard deviation.

Ensembl naming conflicts and improved transcript diversity in V4.3.

(A, B) Annotated UCSC Genome Browser screenshots of (A) cenpq and mrpl39 loci and (B) talgn3b and abhd10b loci. RefSeq, Ens95, Ens99 and V4.3 transcript annotations are shown. Ensembl-annotated transcripts overlapping adjacent locus and leading to misassignment of gene names are indicated. (C) Apparent detected expression for abhd10b and tagln3b in pdgfrbpos RNA-seq reads (n = 3) using indicated annotation. Error bars denote SEM. The abhd10b transcripts are annotated as belonging to the tagln3b gene in Ensembl95 and 99. This leads to a failure to detect abhd10b when using Ensembl annotations, along with spurious inflation of tagln3b expression levels due to the inclusion of reads that should be mapped to abhd10b. (D) Histogram plot of numbers of transcripts per gene for indicated annotation. The plot is limited to genes with 10 or fewer transcripts. (E) UCSC browser image of the erg locus showing transcripts from indicated annotation and mapped reads from GSE32900.

The V4.3 annotation improves the detection of cell-type-specific genes from bulk RNA-seq data.

(A, B) Volcano plots of RNA-seq data from (A) kdrl-positive and negative and (B) pdgfrb-positive and negative cells quantified using V4.3. (A, B) Numbers of differentially expressed genes, along with selected known (A) endothelial or (B) mural cell genes are indicated with green dots. (A, B) Genes with significant differences (padj <0.05) are shown as red (log2 fold change pos/neg > 1) or blue (log2 fold changepos/neg <-1). (C–E) Left panels, UCSC browser images of (C) slc7a5 (minus strand), (D) slc2a1a (plus strand), and (E) cspg4 (minus strand) loci showing 3' UTR annotations from V4.3, ENS95, and RefSeq. Mapped depth of RNA-seq reads from indicated cell type on the genome, or assigned to each annotation are indicated, as are 3P-seq features. (C–E) Right panels, log10 normalized expression of (C) slc7a5, (D) slc2a1a, and (E) cspg4 in replicate RNA-seq samples (n = 3) quantified using indicated annotations. Values are normalized with median ratio normalization. (C, D) Values display normal distribution (Shapiro-Wilks test) and analysis of variance revealed statistical significance (slc2a1a, p=0.0160; slc7a5, p=0.0002). Adjusted p-values from Dunnett's multiple comparison tests are indicated. (E) Values are not normally distributed, variance determined by Friedman test (p=0.0278). p-values from Dunn's multiple comparison test are indicated. Error bars represent mean and standard deviation.

Analysis of Nr2f2<sup>pos</sup> and Nr2f2<sup>neg</sup> datasets using V4.3.

(A) Volcano plot of RNA-seq data from Nr2f2pos and Nr2f2neg cells quantified using V4.3. Numbers of differentially expressed genes are shown. Selected known venous endothelial genes are indicated with green dots, as is erg, which is not detected as differentially expressed by Ens95. Genes with significant differences (padj <0.05) are red (log2 fold change pos/neg > 1) or blue (Nr2f2pos-enriched, log2 fold change pos/neg <-1). (B) Venn diagram showing intersection of reference genes (using commonly annotated NCBI ID) that are significantly enriched in Nr2f2pos-cells in each indicated annotation. (C) Volcano plot of Nr2f2pos-enriched reference genes identified only by Ens95 in comparison to RefSeq. Indicated values are from the same genes quantified using V4.3. Red dots indicate log2 fold change<-1 and adjp <0.05. (D) 3' UTR lengths and (E) log10 average expression (n = 3) for reference genes in Ens95 and V4.3 for the indicated dataset. (D, E) Data are normally distributed, paired t-test, p values are indicated. Error bars denote mean and standard deviation.

(A, B) tSNE plots of cells from 5 day post fertilization (dpf) zebrafish embryos from the same mapped scRNA-seq reads quantified with (A) Ens95 or (B) V4.2. The total number of clusters and cells that passed filtering are shown. Clusters of interest noted in the text are named. Cartilage and epidermis clusters are circled. (C, D). tSNE plots showing restricted expression of (C) mia in cartilate cells and (D) and1 in epidermis cells for both Ens95 and V4.2 annotations. Each cluster is indicated by a circle. Legends indicate log-transformed and normalized expression values.

Standard variance and principal component comparisons for scRNA-seq analysis.

(A) Plots of standard deviation against the number of principal components using whole embryo scRNA-seq data quantified using indicated annotation. Standard deviation values for 75 principal components are shown in both cases. (B) Plots of cumulative variation against the number of principal components from scRNA-seq data quantified using indicated annotation. Total cumulative variance values for 75 principal components are shown in each case. (C) Histogram plot of the number of genes per cluster. The bin size is 50. No clusters have 0 genes. (D) Histogram showing the number of cells per cluster. The bin size is 100. No clusters have 0 cells.

Identification of cartilage and epidermis cell clusters in scRNA-seq data.

(A) tSNE plots showing expression of cartilage markers mia, matn1, col2a1a, and fgfbp2 using clustering based on data quantified with Ens95 or V4.2, as indicated. (B) tSNE plots showing expression of epidermis markers and1, and2, and msx2b using clustering based on data quantified with Ens95 or V4.2, as indicated. (A, B) Legends indicate log-transformed normalized expression level per cell.

(A) Venn diagrams illustrating intersection by common LL gene ID of genes enriched in mia-positive cartilage cells and and1-positive epidermis cells by both indicated annotations. (B) 3' UTR lengths from Ens95 and V4.2 for reference genes identified as cartilage or epidermis-specific uniquely by V4.2. Data are not normally distributed (Shapiro-Wilks test), comparison by Wilcoxon matched-pairs signed-rank test, p-values are shown. (C) Volcano plots showing cartilage- or epidermis-specific genes identified selectively by V4.2 with corresponding values from bulk RNA-seq comparison of pdgfrbpos and pdgfrbneg cells. Red indicates log2 fold change >1 (pdgfrbpos/pdgfrbneg), padj <0.05. (D, E) tSNE plots of the mia-positive cartilage cluster showing expression levels of sox9a and fzd9b using scRNA-seq data quantified using (D) Ens95 and (E) V4.2. Legends indicate log-transformed and normalized expression levels per cell. (F) UCSC browser image of sox9a and fzd9b loci showing 3' UTR annotations from V4.2 and Ens95. Both loci are located on the negative strand. Mapped read depth from kdrlpos cells on the genome, or assigned to each annotation are indicated, as are 3P-seq features, all of which reside in the same orientation as the genes themselves.

Improved detection of cartilage and epidermis genes in scRNA-seq data using V4.2.

(A, B) Left, plots showing values from bulk RNA-seq comparison of pdgfrbpos and pdgfrbneg cells quantified with Ens95 for (A) cartilage and (B) epidermis genes identified as such selectively in V4.2 by scRNA-seq. Red indicates log2 fold change >1, padj <0.05. (A, B) Venn diagrams showing the intersection of (A) cartilage and (B) epidermis genes identified as pdgfrbpos-enriched in bulk RNA-seq using indicated annotation. (C) Examples of epidermis genes identified by scRNA-seq using V4.2, but not Ens95. Left, tSNE plots of epidermis cluster from data quantified with Ens95 or V4.2 showing expression of fbln1, mmp11a, and ptx3a. Right, 3' UTR annotation for each gene from Ens95 or V4.2. Location and read density of scRNA-seq reads are indicated.

(A) tSNE plot of all clusters from Ensembl (Ens95)-quantified scRNA-seq of zebrafish embryos at 5 days post fertilization with one erythroid and two leukocyte clusters indicated. (A–E, G) Circled clusters denote sla2pos leukocyte cells. Asterisk marks cells spuriously assigned to sla2pos leukocytes, but identified as erythroid two when using the V4.2 annotation. (B–E) tSNE plots showing only sla2pos leukocyte (circled) and erythroid clusters with expression of (B) known leukocyte markers, coro1a, and rac2, (C) erythroid marker hbae1.1, and (D) sla2. (E) tSNE plots showing expression of pcna and myb in sla2pos and erythrocyte clusters. (B–E) Legends indicate log-transformed and normalized expression values. (F) Erythroid and sla2pos (leukocyte) clusters identified using the same scRNA-seq dataset as in (A) but quantified using V4.2. A unique erythroid-like cluster (erythroid 2) is circled. (G) tSNE plot of erythroid and sla2pos leukocyte clusters identified with Ens95 quantified data. Dark green denotes cells with identical barcodes as those in the erythroid two cluster identified using V4.2 shown in (F). (H) Three-way Venn diagram intersecting cluster-specific genes of indicated clusters by gene symbol. Selected genes are shown.

Improved detection of erythroid markers in V4.2.

(A) tSNE plots of blood cell clusters from Ens95- and V4.2-quantified data showing expression of slc25a37, tfr1a, and itga4. Erythroid 2 cells spuriously identified as leukocyte using Ens95 quantification are circled. Legends indicate log-transformed and normalized expression values. (B) 3' UTR annotations from Ens95 and V4.2 for slc25a37 (on the positive strand), tfra1a (on the negative strand), and itga4 (positive strand). The 3' end of the cerkl gene on the negative strand overlaps the itga4 3' UTR. Therefore, in this case we show read densities split based on strand.

Acknowledgments
This image is the copyrighted work of the attributed author or publisher, and ZFIN has permission only to display this image to its users. Additional permissions should be obtained from the applicable author or publisher of the image. Full text @ Elife