High quality phenotypic data is essential for producing accurate genomic predictions


Genomic selection , first introduced by Meuwissen et al. , has been adopted by alfalfa breeding programs in recent years for improvement of a variety of complex traits . By enabling breeders to perform marker-only selection, a cycle of selection could be completed in less than a year, addressing one of the biggest impediments to faster genetic gain – the need for multi-year evaluations . The majority ofGS applications in alfalfa to date are based on phenotypic and genotypic data at the individual plant level . However, alfalfa is marketed as synthetic cultivars with breeding based on the evaluation of families, not individual plants. This is particularly relevant when attempting to evaluate yield. An alternative approach to GS is pooling individuals and genotyping family bulks rather than individuals . Phenotypic data can then be collected at the family level from densely planted plots, a better representation of commercial yield, and the relationship matrix is built based on allele frequency marker data, rather than individual genotyping calls. High quality phenotypic data is essential for maximizing the predictive ability of a GS model . Field trials are inherently variable, particularly in the case of perennial crops with multiple harvests where spatial and temporal differences can have a large impact the quality of phenotypic data. Randomization, blocking, and complex trial designs are commonly used to account for field variation; however, they are often inadequate to account for all the spatial trends in large experiments . Residual maximum likelihood has become commonplace in plant breeding to estimate variance components and calculate genetic parameters using linear mixed models .

Best linear unbiased predictions of random effects have proven effective in predicting breeding values to guide breeding decisions . Mixed-model analysis has the advantage of handling unbalanced data and can incorporate additional information using covariance matrices, square flower bucket providing more accurate predictions . In field trials, experimental units in close proximity to one-another are expected to be more highly correlated than those far apart, and successive harvests likely will be more highly correlated than harvests separated by a greater time interval. Therefore, a better way of managing spatial and temporal variation is to include covariance matrices in the analysis to account for spatial and temporal trends in the data .The training populations used in this study were part of the UC Davis elite non-dormant alfalfa breeding program. Plants were selected from an evaluation trial in Davis, CA and crossed in the greenhouse in 2018 resulting in two populations – UCAL1960 and UCAL1970. In 2019, seed from both populations was germinated in the greenhouse and transplanted to the field at the UC Davis Plant Sciences Farm. Approximately 200 plants from each population were grown in adjacent areas that were each placed under mesh cages to enable pollination with leaf cutter bees . The seed from each plant was harvested individually and the highest seed yielding families of each population were included in the trial. The entries consisted of 105 half-sib families from UCAL1960, 88 half-sib families from UCAL1970, balanced bulks from each population, and three cultivar checks: Highline, UC Impalo and CUF 101.A field trial was established at two locations, one in Yolo County and the other in Solano County , on the UC Davis Plant Sciences Farm near Davis, CA in May 2020. Although the two sites are close together, the Solano location is in a floodplain of the nearby Putah Creek. The locations are in a Mediterranean environment characterized by hot, dry summers, cool winters, and moderate annual rainfall, which falls predominantly in the cooler months from November to March . The soil type is a Yolo silt clay loam . Soil tests were conducted prior to planting to adjust P, K, and pH according to soil test recommendations. Seeds were sown in trays in the greenhouse in March 2020 and seedlings transplanted in early May. The trial at each location was a randomized complete block design consisting of two replicates, each with 7 rows and 29 ranges. Plots consisted of 24 plants laid out in a regular 4 × 6 grid with 20 cm spaces between plants.

There was a 30 cm space between rows and a 110 cm space between ranges to allow room for mechanical harvesting. This trial was managed as a high-yielding alfalfa stand; soil tests were conducted each year, with amendments made accordingly. The trial was initially irrigated using sprinklers, but after plants were well established, flood irrigation was used to match crop evapotranspiration. Weeds were managed by a combination of manual removal and herbicides, and insect pests were monitored and controlled with insecticide as needed.Dry matter yield was measured by mechanical harvest when the plants had reached 10% bloom. A self-propelled forage plot harvester with a flail chopper and weigh bin was used to cut plots uniformly at 7.5 cm and measure the fresh weight. There was a total of 12 harvests, three in the establishment year, seven in the first full year of production , and two in the second production year . Each harvest, 20 hand grab samples were collected throughout the day of harvest and weighed before being dried at 60°C in a forced-air drying oven for 7 days. Dry samples were then weighed and used to calculate an average dry matter percentage.The first fully expanded trifoliolate leaf from the growing tip of the plant was collected from every plant within each family from a single replication of the trial and pooled for DNA extraction. Subsequently, tissue samples were lyophilized and ground. DNA was extracted from the ground tissue using DNeasy 96 plant kits quantified using a BioTek Synergy H1 Microplate Reader with Take3 Micro-volume plate. Genotyping-by-sequencing libraries were constructed using the protocol of Elshire et al. modified as described previously in Li et al. . Briefly, 100 ng of DNA from each sample was digested with PstI enzyme. Two different types of adapters, a barcode adapter and common adapter, were ligated with the DNA fragment using T4 DNA ligase. Equal volumes of the ligation product from each sample were pooled together and purified using AMPure XP beads . The ligation product was then amplified with the Phusion High-Fidelity PCR Master mix , using 50 ng of DNA as the template and 25 nmole of each PCR primer in a 50 ul reaction system. PCR was performed under the following thermal cycling conditions: 72°C for 5 min, denaturation at 98°C for 30 sec, followed by 12 cycles with 10 sec of denaturation at 98°C, 30 sec of primer annealing at 65°C and 30 sec extension at 72°C. The resulting library was again purified using the AMPure XP beads . The DNA fragment sizes were checked using a Bioanalyzer then all families were pooled and sequenced with Illumina Nextseq. The raw reads were compressed using clumpify software from the BBTools package , then trimmed using Cutadapt . The Tassel 5.0 GBS v2 pipeline was used for SNP discovery .

Default parameters were used for all steps unless otherwise stated. First, tags and taxa are identified from the raw FastQ files and stored in a local database using GBSSeqToTagDBPlugin. Next, black flower bucket distinct tags from the database were retrieved and reformatted to be aligned with the reference genome using the Tag Export To Fastq Plugin. Bowtie2 v2.5.1 was used to align the filtered reads with the Medicago sativa refence genome developed by Shen et al. using the –very-sensitive-local input. The resulting SAM file created from this step was then used to update the local database with the SAMToGBSdbPlugin. DiscoverySNPCallerPluginV2 then identified SNPs from the aligned tags and updated the database. Finally, ProductionSNPCallerPluginV2 was used to generate a VCF formatted genotype file. SNPs were further filtered with VCFTools using the following criteria: biallelic variants only, minor allele frequency > 0.05, minimum sequencing depth for each data point > 4, minimum mean read depth over all samples > 30, and maximum missing data of 10%. After filtering, 39,229 SNPs remained for downstream analysis and genomic prediction.Only the seven harvests in the first full year of production from Solano County were chosen to build the final genomic prediction model. The significant spatial variation and low heritability from the Yolo County location resulted in lower prediction accuracies when included. Forage yield in the establishment year typically does not represent production in subsequent years, so the first three harvests were excluded. Further, as stands age, plant mortality due to diseases and other biotic sources and due to abiotic causes occurs, such that biomass yield measurements do not reflect yield potential as much as resistance or tolerance to these stressors. In an attempt to assess true yield performance rather than resistance to biotic and abiotic, we only used the seven harvests in 2021 for the model . Predictive ability varied across the range of individual harvests with harvest three in May 2021 having the highest predictive ability of 0.27 and harvest four in June 2021 having the lowest at 0.04. The combination of all harvests in 2021 resulted in a predictive ability of 0.30, consistent with Andrade et al. , who utilized similar methods of family bulks for non-dormant alfalfa populations in the south-eastern united states.Alfalfa has suffered from a lack of yield improvement over the last 30 years of breeding and research. Traditional recurrent phenotypic selection, although useful for the improvement of simply inherited traits, has been unsuccessful in translating yield gains to commercial producers. In this study, genomic selection was incorporated into the UC Davis alfalfa breeding program across two elite non-dormant populations in an effort to select families for improved yield potential. Whereas previous predictive models in alfalfa have been developed based on genotyping single plants and phenotypic information from space plants or short family rows, this study used bulk genotyping of families, as demonstrated by Andrade et al. , and uses phenotypic data on densely transplanted mini-sward plots. The goal was to better capture yield that could be expected in a commercial planting, while reducing genotyping costs, and aligning the genotyping with current breeding methodology in alfalfa, which is also performed on a family level. To further improve the quality of phenotypic data, spatial and temporal variation were modeled using variance/covariance structures. The predictive ability of GS models is affected by a range of factors, incluing the number, density and distribution of genetic markers, population size and structure, the degree of LD, and the quality of phenotypic data. The LD decay in this study was moderate , but our marker density was sufficient to provide coverage of the genome. In a study optimizing whole-genome sequencing in autotetraploid blueberry , the predictive ability of genomic selection increased rapidly with the inclusion of more markers; however, a plateau was reached with 10,000 markers or more. GBS therefore should provide sufficient markers for accurate GS in alfalfa, as we noted previously . Population structure can induce bias in genomic predictions if not incorporated into the prediction model. Organized breeding programs can develop significant population structure, as observed in this study, where the two populations included in the training population were related, yet clearly distinct. Here we accounted for population structure by including the first two principal components from a PCA in the model. Population size is perhaps the most important factor influencing the predictive ability of GS. A significant increase in predictive ability for fruit yield in blueberry occurred by increasing population size from 120 to ~1500, with an ideal size of 1000 or more . We evaluated relatively few families in this study which may have impacted the predictive ability of our model, even though it represented a maximum number of families we could reasonably harvest and manage. High throughput phenotyping could be a solution to greatly increase the number of families evaluated for DMY in alfalfa without a significant increase in cost to the breeder . Remotely assessing biomass from drone-based cameras and harvesting a subset of plots would allow a breeder to evaluate significantly more families with a similar cost to current phenotyping methods. The correlation between harvests was significant and positive in this study, and spatial variation is inherent in field trials. Accounting for spatial variation and temporal correlation between repeated measures improves the quality of phenotypic data, thus allowing for more accurate predictions which may translate to genetic gain.