To reconstruct an Iso-Seq non-redundant transcriptome that would be tested as a reference for expression profiling, CISIs were clustered with Evidential Gene to reduce the redundancy between stages and libraries. One representative for each expressed transcript was retained for a total of 28,721 isoforms . Less than four percent of these isoforms did not contain a complete openreading frame, likely due to residual errors in their sequence. Mapping on the Cabernet Sauvignon genome showed that they potentially derive from 24,378 gene loci over both primary contigs and haplotigs . Most isoforms aligned to both a locus on a primary contig and a locus on its associated haplotig, suggesting that despite the differences between haplotypes the mapping could not differentiate between alleles. For 10,727 loci , multiple isoforms mapped on the same locus possibly due to alternative splicing and structural differences between alleles. ISNT included 77.6% of the BUSCO conserved orthologous; while far from representing the complete grape transcriptome, the ISNT dataset included a remarkably high fraction of these conserved genes, particularly considering that ISNT was constructed using expression data from berries only. Interestingly, 301 BUSCO complete gene models were found in multiple copies in the ISNT, 25 liter round pot suggesting that alternative isoforms of these highly conserved genes are expressed during ripening. Putative functions were assigned to the ISNT transcripts as described in the Methods section .
Gene Ontology terms were assigned to 23,386 transcripts with an average of 6.9 GO terms per transcript .Previous analyses of gene content in a limited number of grape cultivars showed that up to 10% of grape isoforms were not shared between genotypes. Some of these “dispensable” genes were associated with cultivar-specific characteristics . To identify protein-coding transcripts characteristic of Cabernet Sauvignon , we looked for homologous sequences among the ISNT transcripts in the PN40024 genome and in the transcriptomes of Corvina , Tannat , and Nebbiolo . Approximately five percent of the ISNT did not have a homologous copy in any of the four datasets . These putative Cabernet Sauvignon private isoforms were involved in various biological processes of berry development and ripening like phenylpropanoid/flavonoid biosynthesis, sugar accumulation and transport, water transport , and cell wall metabolism and loosening .To evaluate the non-redundant Iso-Seq transcriptome’s completeness and usefulness as a reference for RNA-seq analysis, the protein-coding genes in the Cabernet Sauvignon genome were predicted as described in Figure S3. First, the repetitive regions of the genome were masked using a custom-made library of Cabernet Sauvignon MITE, LTR, and TRIM information. Overall, 51% of the assembly consisted of repetitive elements , with 412,994 repetitive elements on the primary assembly and 274,123 on the haplotigs , LTRs were the most abundant class, covering over 335 Mb of the genomic sequences, with Gypsy and Copia families accounting for 201 Mb and 104 Mb, respectively. Next, MAKER-P identified putative protein-coding loci, combining the results of six ab initio predictors trained ad hoc with publicly available experimental evidences. Ab initio predictors were trained using a custom set of 4,000 randomly selected gene models out of the 5,636 high-quality, non-redundant, and highly conserved gene models of the PN40024 V1 transcriptome .
Experimental evidence from public databases was incorporated and used to validate the predicted models. The final MAKER-P prediction included 38,227 high-quality gene models on the primary contigs and 26,789 on the associated haplotigs. Using the covariance models from the Rfam database, 5,780 non-overlapping putative long non-coding RNAs belonging to 275 different families were annotated . Gene models were further improved using the information from all Iso-Seq full-length datasets , RNA-seq, and the publicly available grapevine transcriptome assemblies. This final refinement improved the annotation of the UTRs and added isoform information. PCIRs helped identify 155 new loci not detected by MAKER-P, update the structure of 10,801 gene models, and add 2,712 alternative transcripts. C-FLNC reads introduced 830 additional missing loci and added 3,738 alternative transcripts to the annotation. Together, 14,388 gene models were updated. FLNC reads introduced 14 new loci and 20,493 alternative transcripts, bringing the number of updated model structures to 24,945. Predicted genes without similarity to known proteins in the RefSeq database and without any functional domains identified by InterProScan were removed. The final predicted transcriptome included 55,887 transcripts on 36,689 loci on primary contigs and 40,444 transcripts on 25,479 loci on haplotigs . GO terms were assigned to 80,752 transcripts based on homology with protein domains in RefSeq and InterPro databases .The incorporation of Iso-Seq data in the gene prediction pipeline also allowed the structural annotation of alternative transcripts. Twenty five percent of the 62,168 annotated gene loci had two or more alternative isoforms, an average of 1.55 6 1.29 alternative transcripts per locus, confirming previous reports in PN40024 . The frequency of splicing variant types was similarly observed in other plant species . Intron retention was the most abundant type, accounting for over 44% , similar to what was observed for rice , Arabidopsis and maize .
Alternative acceptor sites , alternative donor sites , and exon skipping were the other most abundant types of alternative splicing found in the Cabernet Sauvignon genome; a full description of the selected splicing events is reported in File S12.The final step of the analysis was to evaluate the effectiveness of the reconstructed ISNT as reference for RNA-seq analysis of berry development compared to the gene space predicted on the Cabernet Sauvignon genome. Comparisons between the predicted transcripts and the reconstructed ISNT as references for RNA-seq are summarized in Table 1. Only about three percent more RNA-seq reads mapped on the Cabernet Sauvignon predicted transcriptome than on the ISNT , suggesting that Iso-Seq reconstructed most of the transcripts detectable by RNA-seq at a coverage of 26 M reads/ sample. Approximately 75% of the ISNT and 49% of the predicted gene space was detected as expressed in at least one stage . In both datasets, the number of expressed genes was slightly higher at pre-véraison stage than at later developmental stages, consistent with previous observations of ripening Cabernet Sauvignon berries . For both datasets, Pearson’s correlation matrix and Principal Component Analysis showed a clear distinction between pre-véraison stage and the three ripening stages, as well as a stronger correlation between post-véraison and full-ripe berry transcriptomes , confirming the well-known transcriptional reprogramming associated with the onset of ripening and suggesting that similar global transcriptomic dynamics of berry development can be obtained using either Iso-Seq or the whole genome as reference. We then applied a sequence clustering approach to define associations between ISNT isoforms and gene loci to directly compare the expression values of each gene in the two transcriptomes. Based on reciprocal overlap of the alignment, we were able to associate 25,306 ISNT transcripts with 26,873 gene loci in the Cabernet Sauvignon genome . Gene expression levels measured on the two transcriptomes were well-correlated . Differential gene expression analysis identified 14,477 ISNT transcripts and 18,600 Cabernet Sauvignon genes significantly differentially expressed at least once during berry development . More genes were differentially regulated between pre-véraison and véraison than during ripening for both transcriptomes, as previously observed . Ninety one percent of the differentially expressed ISNT isoforms were also differentially expressed when RNA-seq data were mapped on genomic loci. Similar relative amounts of Biological Process GO terms among differentially expressed genes were observed between the two transcriptomes . Interestingly, 302 Cabernet Sauvignon private isoforms were differentially expressed during berry development, 25 liter pot including transcripts encoding a polyol transporter, an inositol transporter, and five aquaporins.Full-length cDNA sequencing with SMRT technology can be used to rapidly reconstruct the grape berry transcriptome, enabling the identification of cultivar-specific isoforms, refinement of the Cabernet Sauvignon genome annotation, and the creation of a reference for transcriptome-wide expression profiling. In contrast to transcriptome reconstruction using short-read sequencing that requires de novo assembly, Iso-Seq delivers full-length transcripts that eliminate the introduction of assembly errors and artifacts like chimeric transcripts and incomplete fragments due to PolyA capture . The incorporation of high-coverage short-read sequencing is still necessary to benefit from the complete transcript sequencing enable by Iso-Seq. Although Iso-Seq provides much longer reads than second-generation sequencing platforms and as a result is excellent in resolving transcript structure, its sequencing error rate is high and throughput is still relatively low . Here we show that combining Iso-Seq with Illumina sequencing at high coverage enables expression profiling and sequence error correction of IsoSeq reads, particularly those derived from low-expression genes.
The clustering analysis of the SMRT link pipeline discarded 18.5% of the FLNC reads, likely caused by low sequence accuracy. To overcome this technical issue, we applied a hybrid error correction pipeline consisting in performing the error correction of the unclustered FLNC reads, followed by an additional clustering step of both to resolve redundancies. Error correction with Illumina reads recovered a significant amount of Iso-Seq reads that would have otherwise been removed by the standard Iso-Seq pipeline, highlighting the importance of integrating multiple sequencing technologies with complementary features . Transcriptome reconstruction has been widely used to develop references for genome-wide expression profiling in the absence of an annotated genome assembly . Though a genome reference is available for grape, transcriptome reconstruction overcomes the limitations of a cultivar-specific reference that lacks the gene content of other cultivars. Although cultivar-specific genes appear nonessential for berry development, those private genes could contribute to cultivar characteristics. For example, the wine grape Tannat accumulates unusually high levels of polyphenols in the berry; its cultivarspecific genes account for more than 80% of the expression of phenolic and polyphenolic compound biosynthetic enzymes . De novo transcriptome assembly from short RNA-seq reads has been used to explore the gene content diversity in Tannat , Corvina , and Nebbiolo . Iso-Seq identified 1,501 Cabernet Sauvignon transcripts expressed during berry development that were found in neither the genome of PN40024 nor the transcriptomes of Tannat, Nebbiolo and Corvina. Some private Cabernet Sauvignon transcripts have functions potentially associated with traits characteristic of Cabernet Sauvignon grapes and wines like their color and sugar content. These transcripts included three sugar transporter-coding genes, which could be involved in the accumulation of glucose and fructose during berry ripening , and a chalcone synthase, a flavanone 3-hydroxylase, and a flavonoid 39-hydroxylase, all involved in the flavonoid pathway. Chalcone synthases catalyze the first committed step of the flavonoid biosynthesis pathway , which produces different classes of metabolites in grape berry, including flavonols , flavan-3-ols and proanthocyanidins , and anthocyanins . In addition, products of the flavonoid 39-hydroxylase can lead to the synthesis of cyanidin-3-glucoside, a red anthocyanin . The analysis of the gene space in the genome assembly showed that private Cabernet Sauvignon genes identified using Iso-Seq are only a fraction of the private Cabernet Sauvignon transcriptome. As in other transcriptome reconstruction methods, Iso-Seq can only identify transcripts that are expressed in the organs and developmental stages used for RNA sequencing. Obtaining the full set of private transcripts without genome assembly would require sequencing additional organs and developmental stages. In addition, it is challenging to differentiate isoforms derived from close paralogous genes, alleles of the same gene, and alternative splicing variants, in any transcriptome obtained by RNA sequencing ; this potentially leads to an overestimation of the genes in the final transcriptome reference. This study could not resolve isoform redundancy in the final transcriptome for about 37% of the gene loci in the Cabernet Sauvignon genome. This is a limitation of Iso-Seq as well as of all transcriptome references that cannot be overcome without a complete genome assembly. In this study, we tested whether the transcriptome reconstructed using Iso-Seq can be used for expression profiling. Only an approximately 3% difference in read alignment between ISNT and the genome reference was observed, implying that at high coverage, ISNT detects almost all genes expressed during berry development. The slight difference in mapping rate between the two references can be explained by either the absence of some low-expression transcripts in the ISNT or the residual error rate in isoform sequences. Gene expression analysis using the ISNT as reference showed similar results compared to the Cabernet Sauvignon genome assembly, with a very high correlation of expression level and differential gene expression, and with similar global transcriptomic changes. However, we observed differences in the number of expressed and differentially expressed features that depend on the reference used.