Flies were individually disrupted using a 3-mm diameter steel bead in a TissueLyser for 30 s at 30 Hz in 100 lL of 2 mg/mL Proteinase K in PK buffer before being spun down in a centrifuge for 1 min at 10,000 rpm and incubated for 2 h at 56 C. 100 lL of MagMAX DNA lysis buffer was added to each sample, followed by a 10-min incubation, before proceeding to DNA purification using a BioSprint DNA Blood Kit on a BioSprint 96 Workstation , using protocol “BS96 DNA Tissue” as per manufacturer’s instructions. Supplementary Table S1 contains all sample names, collection locations, and time of collection.The D. suzukii mitogenome sequence and ten D. pulchrella COX2 sequences were downloaded from NCBI. The Drosophila subpulchrella mitogenome was identified by running BLAST with the D. suzukii mitogenome against the D. subpulchrella genome assembly , and annotated using MITOS2 . COX2 sequences from all our D. suzukii samples were identified by aligning raw reads to the D. suzukii mitogenome , filtering out any read pairs where one of the reads was unmapped . Variants were called with Freebayes version 1.1.0 in haploid mode , and fasta sequences were extracted with bcftools-consensus version 1.10.2 . Publicly available COX2 sequences of D. pulchrella, D. suzukii, D. biarmipes, D. lutescens, D. mimetica, and D. melanogaster were downloaded from GenBank . All COX2 sequences were aligned with the ClustalOmega web portal resulting in a 720 base pair alignment. Forty-seven haplotypes were identified using DNASP version 6.12.03 . MEGA version 10.1.8 was used to identify the best nucleotide substitution model based on the Bayesian Inference Criterion score .
While the best scoring model was the Tamura 3-parameter model þ invariant sites þ gamma distributed rates ,plant pot with drainage we decided to use the second best scoring model T92þG, as combining þI and þG may be problematic due to correlated parameters . Using MEGA, initial trees for the heuristic search were obtained automatically by applying Neighbor-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using the Tamura 3 parameter model and then selecting the topology with superior log likelihood value. Bootstrap percentages were generated from 500 replicate runs. Five categories were used in the discrete Gamma distribution to model evolutionary rate differences among sites .Based on admixture proportion estimates and PCA, samples were grouped into the following clusters: Eastern United States, Western United States, Hawaii, Brazil, Ireland, Italy, South Korea, and Japan. One Eastern US population from Alma Research Farm, Georgia was excluded as it unexpectedly clustered with the Western US populations. To root the population trees, two sister species of D. suzukii were downloaded and aligned to the D. suzukii genome; D. biarmipes and D. subpulchrella . GLs and SNPs were called with ANGSD as described above and pruned by only keeping SNPs found originally in the 1-kb pruned dataset used for PCA and admixture analysis. As X-linked and autosomal SNPs may have different phylogenetic signals, X-linked SNPs were excluded. As Treemix requires genotypes to be called, PCAngsd was used to call genotypes from the GLs with a 95% posterior probability cutoff using estimated inbreeding coefficients as a prior. When looking at the distribution of fraction of missing genotypes per site, we observed a peak at 10%, and decided to exclude any sites with greater than 20% missing data across samples, consistent with cutoffs in other studies .
We also excluded sites for which data are completely missing within any one cluster as required by Treemix, leaving 29,145 SNPs for analysis. Treemix version 1.13 was used to generate population admixture graphs with inferred migrations. Between 0 and 10 migrations were tested, each with 100 bootstraps calculated using a resampling block size of 500 SNPs, with global tree rearrangements and standard error estimation of migration weights enabled. The bootstrap run with maximum likelihood for each migration tested was used for plotting. To estimate support for migration edges, Treemix was also used to calculate F3 and F4 statistics using a resampling block size of 500 SNPs to estimate standard error and Z-scores. The F3 statistic tests if population A’s allele frequencies are a result of mixture of allele frequencies from populations B and C. A significantly negative value of F3 supports admixture of B or C into A. The F4 statistic measures correlations in allele frequencies between populations A and B versus populations C and D. F4 is expected to be zero under no admixture. Assuming the tree exists, a significantly positive value suggests gene flow between A and C or B and D, while a significantly negative value suggests gene flow between B and C or A and D. By setting one of these populations to be an out group where no admixture is expected, it is possible to infer which population pair experienced admixture. We used a Z-score cutoff of 2 or 2 to determine if a value was significantly positive or negative.To determine if population structure exists in D. suzukii living in recently invaded locations, we sequenced wild-caught individual D. suzukii flies collected from the continental United States, Brazil, Ireland, Italy, South Korea, and China, as well as a laboratory strain from Hawaii and Japan .
After aligning sequences to the reference genome, we found that average read coverage was low for some individuals and populations, with mean coverage per cluster ranging from 5- 11X . As low coverage can cause biases in genotype calling, we used methods that implemented genotype likelihoods wherever possible.We first used PCA and admixture proportion estimates to search for signs of population structure. When examining our Asian samples, we were surprised to discover that all the Namwon, South Korea samples as well as one Sancheong, South Korea sample clustered tightly with the Kunming, China population, rather than with the rest of the Sancheong samples . As several sister species to D. suzukii with similar morphological appearances occupy the same geographic ranges , we performed a phylogenetic analysis using the mitochondrial COX2 gene sequence to evaluate species identity . Based on phylogenetic inference, we determined that the Namwon, South Korea samples; Kunming, China samples; and one Sancheong, South Korea sample may actually be D. pulchrella. For this reason, these samples were excluded from further analyses. As sampling was heavily concentrated in the United States, we first conducted PCA and admixture proportion estimation on each broad geographical region separately before analyzing all populations together . Among the Eastern US samples, PCA did not separate samples by state or latitude, and no distinct populations emerged in admixture plotting at multiple clustering values . Among the Western US samples, both the first principal component and varying values of k for admixture proportions separates Hawaii from the other sample sites; however, higher values of k and principal components do not further partition the remaining Western US samples. Thus, it appears there is likely no strong population structure in a north to south cline in the United States. Using a similar approach, we see that in the European samples, collections from Ireland and Italy partition as separate clusters in the first PC and when k ¼ 2 in admixture plotting. We also observe that samples from Asia partition into Japan and South Korea, which is unsurprising as the Japanese samples originate from a laboratory population. We then used PCA to analyze all samples together to examine how differentiated invasive populations were from each other and from the ancestral Asian samples . As subtler signals can be obscured by unequal population sampling , we also analyzed a reduced dataset by sub-sampling five individuals from each region . When using all samples,pots with drainage holes the first principal component separates Eastern and Western US populations, with Asian and European samples in-between. Samples from Pelotas, Rio Grande do Sul, Brazil, appear more related to Eastern US samples, although one individual clusters more with the Western US flies. We also noticed that all samples collected from the Alma Research Farm , Georgia clustered with the Western rather than Eastern US samples, despite two other Georgia sites nearby that followed the expected pattern. The second principal component then separates the European samples. When the data are sub-sampled to five individuals per cluster, the first and second components strongly separate Hawaii and Japan, respectively ; this signal was likely obscured by the large number of US samples when all samples are analyzed together but is expected as these two populations were laboratory strains and have likely experienced significant genetic drift relative to wild relatives. The observations made from PCA are largely recapitulated when using sub-sampled data to estimate admixture at varying levels of k . At k ¼ 3, we observe Japanese and Hawaiian samples form their own clusters, while all the wild collections form a third cluster.
As k is increased up to 7, we see the appearance of Europe, Brazil, Eastern United States, and South Korea samples as their own clusters, before samples from Europe are split into Ireland and Italy at k ¼ 8. We notice increased variability in cluster assignment in the US populations, particularly when sub-sampling, which likely reflects the large sample size and high within-population diversity . However, analysis using all individuals still clearly supports Eastern and Western US samples as distinct genetic populations . In addition, we also see that the AR Georgia population again clusters with the Western United States. As we were unsure if this could be the result of a very recent migration or mislabeled samples, we decided to exclude this population from further analyses. To further quantify the amount of differentiation present between regions, we estimated Fst values between regions using the 20 largest contigs, spanning all 4 chromosomes and covering 54% of the reference genome . Three general levels of differentiation were apparent based on this analysis. As expected, Fst between Hawaii or Japan to any wild population was high . Irish and Italian populations had intermediate levels of differentiation with the other wild populations and with each other , whereas Fst values between Brazil, South Korea, and both US clusters were lower . These groupings broadly match those observed from PCA .While PCA and admixture proportion estimates were able to identify population clusters, they are unable to provide more detailed depictions of population history or migration events. To estimate the population history of these invasive populations, we used Treemix to generate a population admixture graph with inferred migration events based on co-variance of allele frequencies between clusters, testing models allowing between 0 and 10 migrations . Residuals of the model at m ¼ 6 are within 65 standard errors between populations, suggesting the model fits the data well, despite the variance of Hawaii with itself appearing less well modeled . The strongest signal of admixture was found in the Western United States, with an estimated Hawaiian admixture proportion of 41.0% , and was also observed in most models . To formally test for admixture, we used the F3 admixture statistic in the form F3 where popX represents any third population, and found significantly negative values for all populations , strongly supporting admixture of Hawaii into the Western United States . We also used the F4 statistic, using the form F4 such that a negative value supports “B” and “C” admixture, whereas a positive value supports “A” and “C” admixture, assuming no migration occurred between the out group and either A or B. Using either D. biarmipes or D. subpulchrella as the out group, the tests F4 and F4 were significantly positive , again supporting this admixture . Thus, the Western US population sampled is composed of nearly equal ancestry from a Hawaiian ancestor and the common ancestor of the US/Brazil populations. As Treemix assigns the edge with smaller weight to be the “migrant” edge by default, it may be unidentifiable whether the Hawaiian ancestor or the US/Brazil common ancestor should be called the migration source. We also observed two countries with US admixture in the m ¼ 6 model. Ireland had an Eastern US admixture of 25.3% , although at varying values of “m” the source of this admixture fluctuates between the Eastern United States, Brazil, or the Eastern US/Brazil ancestor. However, in all cases, the admixture strength and significance remain consistent . While no F3 statistic support was found, the F4 statistics and were significantly negative, supporting Ireland’s Eastern US/Brazilian and European ancestry.