Each dataset was mapped to the predicted gene set using CLC Genomics


The MAKER predictions formed the basis of the first official gene set . To improve the structural and functional annotation of genes, these gene predictions were manually and collaboratively edited using the interactive curation software Apollo. For a given gene family, known insect genes were obtained from model species, especially T. castaneum and D. melanogaster, and the nucleotide or amino acid sequences were used in BLASTx or tBLASTn to search the L. decemlineata OGS v0.5.3 or genome assembly, respectively, on the i5k Workspace@NAL. All available evidence , including additional RNAseq data not used in the MAKER predictions, were used to inspect and modify gene predictions. Changes were tracked to ensure quality control. Gene models were inspected for quality, incorrect splits and merges, internal stop codons, and gf3 formatting errors, and fnally merged with the MAKER-predicted gene set to produce the official gene set . For focal gene families , details on how genes were identified and assigned names based on functional predictions or evolutionary relationship to known reference genes are provided in the Supplementary Material . To assess the quality of our genome assemblies, we used BUSCO v2.0 to determine the completeness of each genome assembly and the official gene set , separately. We bench marked our data against 35 insect species in the Endopterygota obd9 database, which consists of 2,442 single-copy orthologs . Secondly, we annotated and examined the genomic architecture of the Hox and Iroquois Complex gene clusters. For this, tBLASTn searches were performed against the genome using orthologous Hox gene protein sequences from T. castaneum and A. glabripennis.

Provisional L. decemlineata models were refned, and potential gene duplications were identified,growing raspberries in container via iterative and reciprocal BLAST and by manual inspection and correction of protein alignments generated with ClustalW2, using RNAseq expression evidence when available. Finally, we used BlobTools v1.0 to assess the assembled genome for possible contamination by generating a Blobplot. This plot integrates guanine and cytosine content of sequences, read coverage and sequence similarity via blast searches to assess genome contamination. Putative contaminants were identified using the combined results of BLASTn against NCBI’s nt database and DIAMOND BLAST against the UniProt reference database, and read coverage was assessed by mapping the 100 bp paired-end reads from the 180 bp insert library to the ALLPATHS genome with the Burrows-Wheeler Aligner version 0.7.5a. As these sequences are nested within the genome scafolds, and could represent horizontally transferred DNA, we lef the scafolds intact and provide a supplementary dataset of the contaminant sequences.In order to identify rapidly evolving gene families along the L. decemlineata lineage, we obtained ~38,000 ortho-groups from 72 Arthropod species as part of the i5k pilot project from OrthoDB version 8. For each ortho-group, we took only those genes present in the order Coleoptera, which was represented by the following six species: A. glabripennis, A. planipennis, D. ponderosae, L. decemlineata, O. taurus, and T. castaneum. Finally, in order to make accurate inferences of ancestral states, families that were present in only one of the six species were removed. This resulted in a fnal count of 11,598 gene families that, among these six species, form the comparative framework that allowed us to examine rapidly evolving gene families in the L. decemlineata lineage. Aside from the gene family count data, an ultrametric phylogeny is also required to estimate gene gain and loss rates. To make the tree, we considered only gene families that were single copy in all six species and that had another arthropod species also represented with a single copy as an outgroup. Outgroup species were ranked based on the number of families in which they were also single copy along with the coleopteran species, and the highest ranking outgroup available was chosen for each family. For instance, Pediculus humanus was the most common outgroup species. For any gene family, we chose P. humanus as the outgroup if it was also single copy. If it was not, we chose the next highest ranking species as the outgroup for that family.

This process resulted in 3,932 single copy orthologs that we subsequently aligned with PASTA. We used RAxML with the PROTGAMMAJTTF model to make gene trees from the alignments and ASTRAL to make the species tree. ASTRAL does not give branch lengths on its trees, a necessity for gene family analysis, so the species tree was again given to RAxML along with a concatenated alignment of all one-to-one orthologs for branch length estimation. Finally, to generate an ultrametric species tree with branch lengths in millions of years we used the sofware r8s, with a calibration range based on age estimates of a crown Coleopteran fossil at 208.5–411 my. This calibration point itself was estimated in a similar fashion in a larger phylogenetic analysis of all 72 Arthropod species . With the gene family data and ultrametric phylogeny as the input data, gene gain and loss rates were estimated with CAFE v3.0. CAFE is able to estimate the amount of assembly and annotation error present in the input data using a distribution across the observed gene family counts and a pseudo-likelihood search, and then is able to correct for this error and obtain a more accurate estimate of λ. Our analysis had an ε value of about 0.02, which implies that 3% of gene families have observed counts that are not equal to their true counts. After correcting for this error rate, λ=0.0010 is on par with those previously those found for other Arthropod orders . Using the estimated λ value, CAFE infers ancestral gene counts and calculates p-values across the tree for each family to assess the significance of any gene family changes along a given branch. Those branches with low p-values are considered rapidly evolving.RNAseq analyses were conducted to establish male, female, and larva-enriched gene sets and identify specific genes that are enriched within the digestive tract compared to entire larva. RNAseq datasets were trimmed with CLC Genomics v.9 and quality was assessed with FastQC . Reads were mapped with >90% similarity over 60% of length, with two mismatches allowed. The number of reads were corrected to reads per million mapped to allow comparison between RNAseq datasets that having varying coverage. Transcripts per million was used as a proxy for gene expression and fold changes were determined as the TPM in one sample relative to the TPM of another dataset.

The Baggerly’s test followed by Bonferroni correction was used to identify genes with significant enrichment in a specifc sample. Statistical values for Bonferroni correction were reported as the number of genes x α value. This stringent statistical analysis was used as only a single replicate was available for each treatment. Enriched genes were removed, and mapping and expression analyses were repeated to ensure low expressed genes were not missed. Genes were identified by BLASTx searching against the NCBI non-redundant protein databases for arthropods with an expectation value <0.001.We investigated the identities , diversity, and genomic distribution of active transposable elements within L. decemlineata in order to understand their contribution to genome structure and to determine their potential positional effect on genes of interest . To identify TEs and analyze their distribution within the genome, we developed three repeat databases using: RepeatMasker, which uses the library repeats within Repbase , the program RepeatModeler, which identifies de-novo repeat elements, and literature searches to identify beetle transposons that were not found within Repbase. The three databases were used within RepeatMasker to determine the overall TE content in the genome. To eliminate false positives and examine the genome neighborhood surrounding active TEs, all TE candidate models were translated in 6 frames and scanned for protein domains from the Pfam and CDD database . The protein domain annotations were manually curated in order to remove: clear false positives,large plastic pots for plants old highly degraded copies of TEs without identifiable coding potential, and the correct annotation when improper labels were given. The TE models that contained protein domains were mapped onto the genome and used for our neighborhood analysis: we extracted the 1 kb fanking regions for each gene and scanned these regions for TEs with intact protein coding domains.Population genetic diversity of pooled RNAseq samples was used to examine genetic structure of pest populations and past population demography. For Wisconsin, Michigan and the lab strains from New Jersey, we aligned the RNAseq data to the genomic scafolds, using Bowtie2 version 2.1.0 to index the genome and generate aligned SAM fles. We used bwa to align the RNAseq from the three populations from Europe. SAMtools/BCFtools version 0.1.19 was used to produce BAM and VCF fles. All calls were filtered with VCFtools version 0.1.11 using a minimum quality score of 30 and minimum depth of 10. All indels were removed from this study. Population specific VCF fles were sorted and merged using VCFtools, and the allele counts were extracted for each SNP. These allele frequency data were then used to infer population splits and relative rates of genetic drif using Treemix version 1.12.

We ran Treemix with SNPs in groups of 1000, choosing to root the tree with the Wisconsin population. In addition, for each pair of populations, we estimated the average genetic divergence by using F-statistics in VCFtools to calculate the ratio of among- to within-population genetic divergence across SNP loci. To infer patterns of demographic change in the Midwestern USA and European populations, the genome-wide allele frequency spectrum was used in dadi version 1.6.3 to infer demographic parameters under several alternative models of population history. The history of L. decemlineata as a pest is relatively well-documented. The introduction of L. decemlineata into Europe in 1914 is thought to have involved a strong bottleneck followed by rapid expansion. Similarly, an outbreak of L. decemlineata in Nebraska in 1859 is thought to have preceded population expansion into the Midwest reaching Wisconsin in 1865. For each population, a constant-size model, a two-epoch model of instantaneous population size change at a time point τ, a bottle-growth model of instantaneous size change followed by exponential growth, and a three-epoch model with a population size change of fixed duration followed by exponential growth, was ft to infer θ, the product of the ancestral effective population size and mutation rate, and relative population size changes.Nitrate contamination of freshwater resources from agricultural regions is an environmental and human health concern worldwide . In agriculturally intensive regions, it is imperative to understand how management practices can enhance or mitigate the effect of nitrogen loading to freshwater systems. In California, managed aquifer recharge on agricultural lands is a proposed management strategy to counterbalance unsustainable groundwater pumping practices. Agricultural managed aquifer recharge is an approach in which legally and hydrologically available surface water flows are captured and used to intentionally flood croplands with the purpose of recharging underlying aquifers . AgMAR represents a shift away from the normal hydrologic regime wherein high efficiency irrigation application occurs mainly during the growing season. In contrast, AgMAR involves applying large amounts of water over a short period during the winter months. This change in winter application rates has the potential to affect the redox status of the unsaturated zone of agricultural regions with implications for nitrogen fate and transport to freshwater resources. Most modeling studies targeting agricultural N contamination of groundwater are limited to the root zone; these studies assume that once NO3 – has leached below the root zone, it behaves as a conservative tracer until it reaches the underlying groundwater or, these studies employ first order decay coefficients to simplify N cycling reactions . However, recent laboratory and field-based investigations in agricultural systems with deep unsaturated zones have shown the potential for N cycling, in particular denitrification, well below the root zone . For example, Haijing et al. found denitrifying enzyme activity as deep as 12 metersin an agriculturally intensive region in China. Lind and Eiland reported N2O production in sediments taken from 20 meter deep cores. Other studies have reported the capability of deep vadose zone sediments to denitrify in anerobic incubations with or without the addition of organic carbon substrates . Moreover, in agricultural settings, especially in alluvial basins such as in California with a history of agriculture, large amounts of legacy NO3 – has built up over years from fertilizer use inefficiencies and exists within the deep subsurface .