GO term enrichment analysis is another method by which the location of QTL can be validated

The presence of phosphate has a well documented impact on the availability of both Fe and P to graminaceous plants, as they interact to form insoluble complexes in the soil . It is possible that the ability to take up iron under high phosphorus conditions has been unintentionally bred into the accessions in Group 2. Many of the plants treated with high iron also had low seed set due to a lack of panicles or the presence of immature panicles. In combination with the visible bronzing on these plants, this finding suggests that the HS was successful in inducing iron toxicity in many of the plants in the high iron condition. Iron deficiency, however, was not visible in the plants treated with low iron media. Though the different levels of iron exposure did not dramatically perturb morphological grouping by region, the morphological groups defined by DBSCAN did not correspond to the ionomic dataset. While variation in ionic homeostasis does appear to be loosely associated with region, the treatment regime to which the plants were exposed was more obviously represented in the clustering of the data.Interestingly, while the plants exposed to the high iron treatment were morphologically the most similar to the soil grown plants, they were the most ionomically dissimilar to the soil grown plants. Given that the plants in the low and control iron treatments were given relatively balanced nutrient media, the differences in morphology between the soil grown plants and their more branched counterparts in the control and low iron treatments are primarily due to the amount of nutrients available to the plant. Without the high iron treatment’s excess of iron to interfere with the uptake of other ions, typical ionomic homeostasis was largely maintained in the low iron and control treatments. Iron treatment alone did not significantly impact plant form,greenhouse vertical farming but the overall growth environment did dramatically impact the morphology of the plants.

When the morphology of all plants was compared on the basis of treatment using both PCA and PHATE, the plants treated with excess iron clustered closely with the soil grown plants. The other iron regimes in the HS produced highly tillered, highly productive plants; these treatments did not appear to impact the morphology within the AL regional group, illustrating the continued import of AL origin in the less severe treatment regimes. In contrast, the high iron treatment tended to produce plants that were relatively short and unbranched, leading to some overlap in morphology between the different regional groups defined by DBSCAN. Quantitative trait mapping identifies regions of a genome, known as quantitative trait loci , that are associated with a phenotype of interest . This mapping process relies on data collection and extensive curation: the collection of data, the removal of both global and conditional outliers, and model selection are all necessary features of the protocol . All of these steps rely on the researcher’s best judgment, and are therefore subject to bias . QTL mapping requires significant effort and resources, and often serves as a starting point for even larger projects in plant breeding or fine-mapping alleles that contribute to phenotypes . It is therefore important to be confident in the quality of QTL mapping results and analyses. Currently, the quality of a QTL mapping experiment is assessed by cross validation and functional Gene Ontology term enrichment analysis . In CV, QTL detection is performed using a subset of the lines involved in the original experiment to assess the robustness of the results; though this method is a valuable tool, there remain concerns about the influence of the population structure on the effectiveness of CV .This method is designed to determine if the annotations of the genes in the identified regions are more frequently associated with the phenotype of interest than are the annotations of the genome as a whole . This approach relies on both accurate GO term annotation and the researcher’s understanding of the complex processes that may contribute to the phenotype. A third method for the assessment of QTL leverages our knowledge of genes that have previously been associated with the trait of interest either in the organism of interest or in its close relatives .

These genes can act as a sort of ‘sanity check’, as known genes of large effect should be found within the identified QTL. Single genes can underlie QTL, but multiple, related genes may also act in concert to control the placement of a QTL . Because the confidence intervals for QTL can be placed some distance away from the causative locus, it becomes necessary to determine if the number of known genes that are found within the confidence intervals of a QTL can be considered statistically significant. One strategy that can allow us to understand the likelihood of identifying a particular number of known genes, given the null hypothesis of random placement of QTL, is resampling with replacement . RWR analysis for this purpose begins with the random selection of regions of the genome that are equal in size to the identified QTL. In each repetition, one records the number of known genes that are found within the selected region.After a large number of repetitions , confidence intervals for the distribution of values produced by RWR can be calculated. These confidence intervals can be compared to the observed number of known genes found; if the observed number exceeds the calculated confidence limit, the QTL can be considered to have found a significant number of known genes. QTL RWR is sensitive to many different factors. Among these are the restrictions that are imposed on QTL placement, the treatment of closely linked genes, and the method chosen to handle the constraints imposed by the physical properties of the genome. Additionally, RWR assumes that the distribution of known genes reflects the true distribution of both known and unknown genes associated with the trait of interest. The present work aims to identify an optimal strategy for RWR analysis of QTL mapping studies. We ultimately propose a new method, Scanning Probabilistic QTL Validation , that is designed to overcome the pitfalls associated with the simplest instance of RWR, specifically in its assumptions surrounding the gene distribution.

We discuss the assumptions made in SPQV versus RWR in the context of the reference genome of Setaria italica. Both methods are used to analyze the results of a simulated QTL mapping experiment. Finally, the SPQV is used to analyze the results of a previously published QTL mapping study. RWR validation of a given QTL involves resampling the genome for randomly positioned regions of length L a large number of times,vertical agriculture where L is the distance between the left and right confidence intervals of the QTL of interest. In the current work, all distances are measured in terms of base pairs. The statistic for each run is K, the number of known genes in the chosen region. In the simplest instance of RWR, the region of length L has no restrictions on location, other than that it must be placed fully on a chromosome. The random selection of the region of length L starts with the selection of an origin base pair, O. The QTL then extends outward from this point to cover the chromosomal interval [O, O+L]. Some complications arise at the tail ends of chromosomes, for example: if a chromosome is length C, then the region contained in [C–L,C] can never contain O and will therefore have a reduced likelihood of being included in the assessment . This ‘under representation’ of the tail ends of chromosomes is non-negligible due to the distribution of genes on individual chromosomes. Generally, the number of genes increases with increasing distance from the centromere, and drops precipitously at the relatively short telomere . There are several methods that can be used to handle the issue of under representation, the first of which can be termed the ‘bounceback method’ . In the bounceback method, if O is selected such that the region of length L extends past the end of the chromosome, O is placed on the last base pair of the relevant chromosome, with the selected region extending to the point C–L. The bounce back method does rectify the under representation of [C–L, C], but results in over representation of the tail ends of chromosomes. A second method for addressing the issues associated with the assessment of the tail ends of chromosomes involves the directionality of QTL extension. In the case of unidirectional QTL extension, the origin O of the region of length L is chosen, and the selection for the RWR sample is then considered to be [O, O+L]. Unidirectional QTL extension’s most prominent fault is in its treatment of the chromosome as a string of text, rather than as a physical object. Mapping in this manner biases gene discovery, particularly for longer QTL, as genes towards the ‘tail’ end of the chromosome are less likely to be found within L. Bidirectional QTL extension reduces the under representation of the tail end of the chromosome by allowing O to fall into the region [C–L, C]. Unfortunately, a slightly less serious issue remains: on a chromosome of length C, the origin O continues to have a somewhat reduced likelihood of falling in the regions [1, 1+L] and [C–L, C] as compared to the region [1+L, C–L].

Selecting O at random is at the heart of RWR, but some restrictions on the location of O are necessary to reflect the actual process of QTL mapping. In general, QTL mapping starts with the identification of the marker that appears to correlate strongly with the trait of interest. The outer boundaries of the QTL are then defined as the region in which the causative locus lies with 95% certainty; these boundaries can then be extended to the closest markers used in the mapping process . In order to best reflect this process, the placement of O should be limited to the markers used in the original mapping experiment. This is particularly important because markers, like genes, are not evenly distributed through the genome, with decreasing density near centromeres and telomeres. Bidirectional QTL extension becomes more important in this context, as the genes located towards the boundary of an interval between markers have either disproportionately high or low likelihoods of being identified, depending on their placement . Because QTL mapping is performed on organisms with more than one chromosome, and because RWR relies upon the distribution of known genes, it is important to recognize the uneven dispersion of the known gene list . Duplication events often result in genes of similar function in tandem array , and genes in functional groups appear to cluster on a larger scale as well . If the QTL found in the original mapping experiment were required to span the entire distance between markers, any genes that are not separated by markers will function as one genetic unit – that is to say, it is impossible to identify one without identifying the others .Due to linkage events of this sort, it is possible to produce a confidence limit via RWR that can only be exceeded by the identification of one specific, disproportionately gene dense locus, even though a much more relaxed interval would have been produced with the use of a single extra marker that split the tandem array. Because of this, known genes that are not separated by markers should be treated as a single locus during RWR. Linkage should also be taken into account when comparing the results of an experiment to the confidence limit produced by RWR: if multiple genes contained between the same two markers were identified in the mapping experiment, they must be treated as a single locus. Once RWR is performed, confidence intervals are still complicated to calculate, because the distribution of genes found is not smooth and is not normally distributed . Because standard confidence intervals rely on the assumption of normality, they cannot be used with the distributions typically produced via RWR. The use of Bias Corrected and accelerated confidence intervals is recommended , which requires fairly sophisticated math to implement. RWR has the potential to be a powerful method for the assessment of QTL mapping, but its execution requires thoughtfulness on the part of the individual researcher.