New plant databases introduced in each version of PMN are Tier 3 BioCyc databases , which indicate that the information is based mostly on automated prediction using their genome. Any experimentally-supported enzymes and pathways in Metacyc or Plantcyc that are annotated as belonging to the organism are also imported into the database along with their citations and codes for the type of evidence the cited papers present. The plant’s remaining complement of enzymes is predicted, and its metabolites and pathways are in turn predicted based on the enzymes. Bringing a new species or subspecies into PMN begins with the sequenced and annotated genome with predicted protein sequences. To be considered for inclusion, a genome must pass a quality metric in the form of BUSCO , which assesses genome completeness using a database of proteins expected to be present in all eukaryotes, with matches assessed using HMMER . A score of at least 75% “complete” is required for inclusion in PMN. If a genome passes this metric, it can then be run through the PGDB creation pipeline. First, splice variants are removed, leaving one protein sequence per gene, with the longest variant being retained. The sequences are classified as enzymes or non-enzymes, large round pots and enzymatic functions are predicted, using the Ensemble Enzyme Prediction Pipeline software .
E2P2 uses BLAST and PRIAM to assign enzyme function based on sequence similarity to proteins with previously-known enzymatic functions based on functional annotations taken from several sources including MetaCyc , SwissProt , and BRENDA . The genomes included in PMN 15 were checked using BUSCO v 3.0.2 using the Eukaryota ODB9 dataset. Enzyme prediction for PMN 15 was done using E2P2 v4.0 and RPSD v4.2, which was generated using data from PlantCyc 12.5, MetaCyc 21.5, BRENDA , SwissProt , TAIR , Gene Ontology , and Expasy . Once enzymes are predicted, they must be assembled into pathways by the PathoLogic function of Pathway Tools . The set of predicted pathways is then further refined using the Semi-Automated Validation Infrastructure software . SAVI is used to automatically apply broad curation decisions to the pathways predicted for each species. It can be used, for example, to specify particular pathways that are universal among plants and should therefore be included in all species’ databases even if not predicted by PathoLogic. SAVI can also be used to specify that a particular pathway is known to be present only within a specific plant clade. Therefore, if the pathway is predicted in a species outside of that clade, it should be considered a false prediction and removed. PMN 15 was generated using Pathway Tools 24.0 and SAVI 3.1. The final parts of the pipeline are grouped into three stages: refine-a, refineb, and refine-c.
In refine-a, the database changes recommended by SAVI are applied to the database and pathways added or approved by SAVI have SAVI citations added. In refine-b, pathways and enzymes with experimental evidence of presence in a plant species are added to that PGDB if they were not predicted, and appropriate experimental evidence codes are added. In refine-c, authorship information is added to the PGDB, the cellular overview is generated, and various automated data consistency checks are run. The convention for PGDB versions was updated in PMN 15. Taking SorghumbicolorCyc 7.0.1 as an example, the first number, 7, is incremented when the PGDB is re-generated de novo from a new version of MetaCyc and/or a new genome assembly. The second, 0, is incremented when there are error corrections or other fixes to the content of the database. A third, 1 in the example, may be added when the database is converted to a new version of Pathway Tools without being regenerated, a process that does not alter the database contents.Since its initial 1.0 release, some changes in curation policy have been made to PMN and PlantCyc. In 2013, the Arabidopsis-specific database, AraCyc, switched from identifying proteins by locus ID to using the gene model ID. This eliminates ambiguity when multiple splice variants exist for a single locus. In PMN 10, the policy for all species was switched from using the first splice variant to the longest. This was done because a longer splice variant is likely to have more domains, making it easier to determine its function. In PMN 10, the database narrowed its focus strictly to small-molecule metabolism, and pathways involved solely in macromolecule metabolism were removed.
Macromolecules have never been the focus of PMN, and provision of information about them is a role better served by other databases with tools specifically suited to large heteropolymers like proteins and DNA/RNA. In version 13 of PMN, the PlantCyc database was limited to only include pathways and enzymes with experimental evidence to support them. The original purpose of including all information, experimental and computational, in PlantCyc was to allow cross-species comparison, a function now served by the virtual data integration and display functionality recently introduced in Pathway Tools . PlantCyc now serves as a repository of all experimentally-supported compounds, reactions, and pathways for plants.One hundred and twenty PMN pathways were randomly selected to manually assess pathway prediction accuracy. The 126 organism-specific PGDBs were then regenerated using E2P2 and PathoLogic alone, with PathoLogic set to ignore the expected phylogenetic range of the pathway and call pathway presence / absence based only on the presence of enzymes , no SAVI, and skipping the step of importing pathways with experimental evidence of a species into that species database if the pathway was not predicted. This resulted in a set of PGDBs based purely on computational prediction that we refer to as “naïve prediction PGDBs”. Biocurators evaluated the accuracy of each of the 120 pathway’s prediction across all 126 organisms in PMN in the naïve prediction PGDBs and, separately, in the released version of PMN. Specifically, we evaluated whether pathway assignments to the PGDBs reflected the taxonomic range of the pathway as expected from the literature. Each pathway’s assignment to the naïve prediction PGDBs and released PGDBs was classified with respect to the expected taxonomic range as either “Expected” , “Broader” , “Narrower” , or it was identified to be a non-plant or non-algal pathway, and therefore classified as a non-PMN pathway.In order to analyze the pathways, reactions, and compounds in each species’ database, presence-absence matrices were generated for each of these three data types. Each is a binary matrix containing the list of PMN organisms as its rows and a list of PRCs of one type as its columns. Each matrix element is equal to 1 if the organism contains the PRC and 0 if it does not . Reactions were only marked as present in a species if the species had at least one enzyme annotated to the reaction, whether predicted or from experimental evidence. Since PRCs that are present in either only one organism or all organisms are not useful in differentiating plant groups, we excluded these PRCs from further analysis. Separately, plastic round plant pots a table was generated that maps the species to one of several pre-defined taxonomic groups . The groups were selected manually to best represent the diversity of species in PMN and included monophyletic and paraphyletic groups, as well as a polyphyletic “catch-all” group . The PRC matrices and the plant group table were used to investigate relationships among the species through the lens of metabolism. We downloaded and integrated datasets from 5 existing Arabidopsis root single-cell RNA-seq studies. Briefly, raw fastq files for 21 datasets derived from studies by and were downloaded, trimmed, and mapped using the STARsolo tool v.2.7.1a. Whitelists for each dataset were obtained either from the 10X Cellranger software tool v. 2.0 for the 10X-Chromium samples, or after following the Drop-seq computational pipeline , extracting errorcorrected barcodes from the final output for the Drop-seq samples.
Valid cells within the digital gene expression matrices computed by STARSolo were then determined as those having total unique molecular identifier counts greater than 10% of the 1st percentile cell, after filtering for cells with very high UMIs. Cells containing greater than 10% mammalian reads, greater than 10% organellar reads, or cells having transcripts from fewer than 200 genes were filtered out. Filtered digital gene expression matrices were then preprocessed using the Seurat package after removing protoplast-inducible genes , using the SCTransform method . All Seurat objects were then integrated together using the approach from , applying the Select Integration Features, PrepSCT Integration, Find Integration Anchors, and Integrate Data functions from the Seurat R package,using 5000 variable features, 20 principal components, and otherwise default parameters. Cell clusters were computed using the Seurat functions, Find Neighbors and Find Clusters, 20 principal components and a resolution parameter of 0.8. Index of Cell Identity scores were computed using a combination of existing ATH1 microarray and RNA-seq single cell datasets . Briefly, arrays were normalized using the gcrma R package, and RNA-seq data were trimmed using the bbduk tool, and mapped using bbmap . Transcript counts were quantified using the feature Counts tool . Raw RNA-seq counts were then normalized using the edgeR package , with the “upperquartile” method. Normalized reads were then further normalized with the gcrma-normalized microarray data using the Feature-Specific Quantile Normalizations method to obtain a dataset consisting of both RNA-seq and microarray-based cell-type specific transcriptome measurements. This dataset was then used to build an ICI specification matrix using the methods described by . This specification table was then used to compute ICI scores for each cell in the integrated single-cell dataset, along with p-values derived from random permutation. To map the single-cell data to metabolic domains, pathways, and enzymes, we used AraCyc v.17.0, which includes 8556 metabolic genes and 650 pathways. We used the pathway-metabolic domain mapping file version 2.0 to map the pathways to 13 metabolic domains. To avoid biases introduced by small sample size to the cell type specificity analysis, we only included pathways containing at least 10 genes whose transcripts were detected in the single cell data described above. Based on these criteria, 198 out of 650 pathways were included in this analysis. To compute cell type specificity at the transcript level, we first calculated the expression level for a pathway or domain per cell type by taking the average of expression values for all the genes annotated to this pathway or domain within this cell type. The cell type specificity was defined as the cell type for which the expression level of a pathway or domain was at least 1.5-fold higher than their background expression, which was calculated by taking the average of expression values for all the genes annotated to this pathway or domain in all cells. Since the expression levels of a pathway or domain per cell type could be influenced by gene expression outliers, we only included the cell types in which more than 50% of genes associated with the pathway or domain showed higher expression than their background expression based on a Wilcoxon test followed by a multiple hypothesis test adjustment using FDR with a threshold of 0.01. The background expression level of a gene was calculated by taking the average of its expression values in all the cells included in this study. Heatmaps were generated using the R package ggplot2 v.3.3.4. To compute cell type specificity at the pathway level, we first selected the set of pathways containing at least 10 genes whose transcripts were captured by the single cell transcriptomic data to avoid biases that could be introduced by small sample size. Based on these criteria, 30% Arabidopsis pathways were included in this analysis.Fruit flavor is an elusive trait, influenced by many factors including genetics, environments and cultural practices . Breeders increasingly are focused on meeting the needs of consumers, but genetic improvement of flavor is challenging as a consequence of the chemical and genetic complexities of the flavor phenotype . These challenges are accentuated in heterozygous, polyploid species. For example, fewer significant single nucleotide polymorphisms were detected in genome-wide association study of tetraploid blueberry when diploid models were applied ; in octoploid strawberry, structural variation underlying a locus affecting volatile production was difficult to resolve using a single reference genome . Recent advances have been made via chemical–sensory studies to identified specific volatiles associated with consumer preference . Although important volatile compounds in fruit crops are being identified, too little is known about the metabolomic and genetic diversity within species and breeding populations.