Fertility, health, education, and other outcomes of interest to demographers are the product of an individual's genetic makeup and their social environment. Yet, gene × environment (G×E) research deploys a limited toolkit on the genetic side to study the gene–environment interplay, relying on polygenic scores (PGSs) that reflect the influence of genetics on levels of an outcome. In this article, we develop a genetic summary measure better suited for G×E research: variance polygenic scores (vPGSs), which are PGSs that reflect genetic contributions to plasticity in outcomes. First, we use the UK Biobank (N ∼ 408,000 in the analytic sample) and the Health and Retirement Study (N ∼ 5,700 in the analytic sample) to compare four approaches to constructing PGSs for plasticity. The results show that widely used methods for discovering which genetic variants affect outcome variability fail to serve as distinctive new tools for G×E. Second, using the PGSs that do capture distinctive genetic contributions to plasticity, we analyze heterogeneous effects of a UK education reform on health and educational attainment. The results show the properties of a useful new tool for population scientists studying the interplay of nature and nurture and for population-based studies that are releasing PGSs to applied researchers.
Growth in the Use of Genome-Wide Measures to Study Genetic Moderation of Environments
A wide range of research has shown that fertility, educational attainment, marked disparities in disease incidence (e.g., obesity), and other outcomes of interest to demographers are influenced by both an individual's genetic makeup and their social environment. This gene–environment (G×E) research has undergone a large shift in its summarization of genetic variation.
Earlier research focused on how a single or a small set of genetic variants moderated social environments to affect outcomes. This research studied, for example, how polymorphisms in specific genes (e.g., the monoamine oxidase A gene, the promoter region of 5-hydroxytryptophan) moderate social conditions, such as stressful childhood experiences or parental abuse (e.g., Guo et al. 2008; for a review, see Seabrook and Avison 2010). Two developments led researchers to abandon this type of research. First, many single-gene G×E studies could not be replicated (Border et al. 2019; Duncan and Keller 2011; Keller 2014). Second, growing evidence suggested that most outcomes of interest to social and behavioral scientists (e.g., educational attainment, body mass index [BMI], depression) are highly polygenic—that is, the result of small contributions of many variants across the genome— rather than monogenic (Boyle et al. 2017). As a result, researchers have shifted to using polygenic scores (PGSs) that summarize genome-wide contributions. As we show later, PGSs have become the workhorse tool for social scientists studying the genetic moderation of environments. As a result, large social science cohort studies—such as the Health and Retirement Study (HRS) (Ware et al. 2018), the National Longitudinal Study of Adolescent to Adult Health (Braudt and Harris 2020), the Wisconsin Longitudinal Study, and the Fragile Families and Child Wellbeing Study—have released, or are considering releasing, PGSs alongside their standard survey measures.
The proliferation of PGSs as the workhorse tool for studying genetic moderation of social environments raises the question: What genome-wide summary should researchers use? Until now, researchers have developed scores meant to predict the conditional mean of an outcome. One problem with using these scores to study G×E is that the PGS used in the interaction term is typically constructed from a meta-analysis of levels effects across multiple cohorts that differ temporally and geographically. As a result, the PGS may be particularly ill-suited for G×E analysis because the PGS is based on the extraction of a signal for a main effect that is common across the plausible range of environments with which researchers may seek to interact it (i.e., the multiple cohorts, countries, and contexts on which it is based). By contrast, a variance polygenic score (vPGS) based explicitly on variation as the estimand in the training may capture signals of variation across environmental contexts. More broadly, the goal of the present study is to expand social scientists’ methodological toolkit by presenting a new summary measure that summarizes genetic contributions to plasticity and demonstrate its utility in gene–environment research.
Implicit Models of Genetic Moderation: Outcome Moderation Versus Variability
Past typologies of different types of gene–environment interactions have focused on differences in the shape of the interaction (e.g., Boardman et al. 2014; Derringer et al. 2019). For instance, Boardman et al. (2014) and Derringer et al. (2019) each summarized three shapes of interactions: (1) diathesis–stress, in which genetic differences are observed in challenging environments but not in nurturing ones; (2) vantage sensitivity, in which genetic differences emerge only in low-stress (i.e., positive) environments but we do not expect interactions between stressful environments and genotypes to alter outcomes; and (3) crossover or differential susceptibility, in which those with a risky genotype evince more adverse outcomes in high-stress environments but also have some of the best outcomes in supportive environments (Boyce and Ellis 2005; Ellis et al. 2011). Researchers investigating the genetic moderation of environments distinguish between these shapes through both theory and the form the interaction effect takes—for instance, differential susceptibility having a crossover shape.
Yet, shape is only one dimension on which genotypes can moderate environments. The second dimension, which occurs regardless of shape, is what form of genetic variation moderates the impact of the environment on an outcome. Here, we review two types.
Moderation Through Dimming or Amplifying
The first type of interaction, coined by Domingue et al. (2020), is moderation through an individual's genotype dimming or amplifying an environment—when a social environment impedes or removes an impediment to the expression of a genetically influenced outcome. For instance, people may vary in their genetic propensity to complete formal schooling (Lee et al. 2018). However, in certain societies, access to schooling may be limited for the entire population or a subgroup of the population (e.g., limited higher education access for women during much of the twentieth century in the United States and elsewhere). If that constraint is removed, individuals’ genetic propensities toward higher education that previously had no avenue for expression can then become manifest. In this case, we would expect a significant interaction in a model in which a person's number of years of schooling is regressed on an indicator for the cohorts impacted by education reform and a summary measure of a person's genetic propensity to complete formal schooling. The coefficient between the genetic summary measure and reform would be null or smaller in the pre-reform years; it would become significant and positive during the post-reform years. Put differently, in societies where education constraints were attenuated or absent, a PGS would poorly predict education before a schooling expansion and would better predict it (i.e., by showing increased genetic penetrance) once access to formal education increased.
As another example, economic changes in the United States removed caloric constraints for much of the population. Genetic predispositions toward higher BMI now interact with an altered food environment (Conley et al. 2016; Guo et al. 2015). Those with a genetic predisposition toward higher BMI, which stems from a genetic architecture related to appetite regulation and impulse control as well as metabolism (Locke et al. 2015), are more highly impacted by the new food environment.
Moderation Through Plasticity
The case of BMI, however, also highlights a different form of genetic variation that can interact with changes to the environment. Individuals may vary not only in their propensity toward higher or lower BMI but also in their propensity toward changes in BMI in the face of environmental changes. Some individuals have genotypes that are less buffering of environmental changes—for instance, changes in the environment between a more calorie-rich or more calorie-restrictive environment. When the environment changes in either direction, their BMI is likely to exhibit large changes. Other individuals have genotypes that are more buffering of environmental changes: when they enter a new environment that differs in available calories, their BMI is less likely to change because they adapt to that environment in ways that minimize changes, regardless of their BMI at baseline. A genetic predisposition toward higher or lower levels of BMI might be very different than a genetic predisposition toward changes in BMI in the face of shifting environmental conditions.
We refer to this type of genetic moderation as moderation through plasticity. Plasticity can take two forms. The first is variation in within-individual plasticity, which is relevant for dynamic outcomes like BMI and depression that change as individuals progress through the life course. As we discuss in the Discussion section, estimating genetic contributions to within-person variability is complicated by the lack of data with both large-scale cohorts that have been genotyped and repeated phenotypic measures across genotyped individuals. The second, and more immediately tractable, form is population-level variation in plasticity. Consider a shock that affects BMIs in a population, such as neighborhood violence that leads to more sedentary activity. Whereas one form of G×E interactions might predict that those with genetic propensities toward high BMI are most impacted by the change, a plasticity-focused interaction would identify individuals who are most sensitive to the environmental shock, regardless of their propensity toward higher or lower BMI.
Previous G×E Research Using Genome-Wide PGSs: Moderation Using Levels Scores
Moderation through dimming or amplifying refers to a differential impact of an environmental change on those with different propensities toward an outcome. Yet, in a particular context—such as changes to neighborhoods interacting with genotype to impact BMI or changes to education policy interacting with genotype to affect schooling—genetic predispositions toward greater variability may also play a role.
PGSs for outcome levels have been researchers’ workhorse tool for studying the genetic moderation of environments. The use of this tool has inadvertently narrowed the research focus to outcome moderation. Researchers use a three-step process when they develop and use these scores:
Step 1: Estimate separate linear regressions of an outcome (Y) in a large training sample, or a data set or randomly selected data subset separate from the data used for analysis, to develop weights that reflect each variant's contribution to levels of that outcome.
Step 2: Use the weights from Step 1 to construct a PGS in a separate sample.
Step 3: Interact that PGS with some measure of E to study genetic moderation of environments.
At Step 1, researchers have focused on genetic contributions to levels of an outcome rather than to variability. Table 1 lists recent G×E studies using PGSs. The table shows that the majority of these studies examined variation in the impact of environments on an outcome among people with different propensities for that same outcome. For example, Robinette et al. (2019) found that the impact of neighborhood features on type 2 diabetes had a larger impact on those with higher genetic propensities toward the disease.
With the exception of the Domingue et al. (2017) study, which examined how a genetic risk score for well-being buffers the impact of the loss of a spouse on depression, the studies listed examined moderation where the genetic moderator is the same as the outcome impacted by the gene–environment interaction. Furthermore, this examination of only one form of genetic moderation is often implicit, with the researchers stating that they are studying G×E interactions rather than identifying an explicit estimand (Lundberg et al. 2021) that refers to a particular type of G×E interaction. The failure to specify the type of moderation explicitly has led to missed opportunities for examining other forms of moderation.
vPGS as a Tool for Examining New Forms of Genetic Moderation
The implicit focus on one form of genetic moderation of social environments stems from the reliance on one tool for G×E research: PGSs used to predict levels of an outcome. We follow recent calls to expand social scientists’ toolbox for studying the genetic moderation of environments. Domingue et al. (2020) distinguished between “dimmer-type” gene–environment interactions corresponding with outcome moderation and “lens-type” gene–environment interactions taking a different form.1 They argued that social scientists often frame their G×E research aims as studying lens-type interactions, but reliance on PGSs for levels of an outcome might impede their progress:
The selection of PGS effects for examining lens-type G×E may be particularly challenging given that we construct PGSs from GWASs [genome-wide association studies] that only include main effects of SNPs [single nucleotide polymorphisms] . . . [I]f the environmental context of the participants in the GWAS sample used to construct the PGS is similar to that in the test sample used to estimate G×E, then it is unlikely to include SNPs that demonstrate lens-type patterns, as the main effects of these SNPs will be close to zero. (Domingue et al. 2020:470)
This call suggests that better tools for variability moderation or lens-type moderation are genome-wide summary measures (PGSs) constructed from weights that more closely mirror the theory behind G×E.
Here, we present one approach: constructing genome-wide summary measures from models that assess genetic contributions to variability in outcomes.2 In statistical genetics, these models are called vQTL analyses and are used to detect variance-affecting genetic loci. Researchers have developed a variety of approaches for detecting genetic contributions to variability but have used them only to find the top SNPs—the few SNPs with the lowest p values in regressions performed separately for each SNP. They have not used the weights from these models to construct genome-wide scores for plasticity.
Rönnegård and Valdar (2011) first coined the term vQTL to discuss genetic contributions to trait variability. Yang et al. (2012) made one of the earliest attempts at vQTL analysis, operationalizing variability as a person's squared z score of a trait; the person deviates from the mean of an outcome in either direction. Wang et al. (2019) and others used the classic Levene's test, which examines whether the error variance significantly differs across subgroups—in genetics, across three subgroups (AA, AB, and BB) at a given locus. However, such attempts can lead to false positives when researchers try to distinguish between variants that affect the mean of an outcome and those that affect the variance.
Two methods have been used to control for this mean–variance conflation. Conley et al. (2018) used sibling pairs to examine how variation in the sibling pair's combined count of minor alleles at a locus contributes to that sibling pair's standard deviation in the trait, controlling for the sibling pair's mean levels of a trait. Young et al. (2018:1613) decomposed trait variance into two components—an additive effect and a dispersion effect—and argued that the latter provides a measure of “when a SNP has a variance effect beyond that which can be explained by a general mean–variance relationship.”3
There is a significant gap in the use of these methods to study gene–environment interplay. Researchers have used each method to find top-hit loci (i.e., SNPs with an association with the outcome where the p value falls below a given threshold) that contribute to variability in traits such as BMI (Conley et al. 2018; Yang et al. 2012; Young et al. 2018). Young et al. (2018), Marderstein et al. (2021), and others also interacted these highly significant single SNPs one by one with measures of social environments. We are not aware of any studies that explored whether the variance weights that these methods generate can be aggregated to produce vPGSs, or genome-wide summary measures of a person's plasticity. vPGSs can expand demographers’ toolbox for studying gene–environment interplay. Our study is the first to build and characterize the properties of this new tool.
Our research addresses three questions:
What are best practices for building vPGSs?
When we build these scores, do they reflect distinctive genetic contributions to variability in a trait, or are they too correlated with scores for levels of an outcome to serve as a new tool for gene–environment research?
When the scores are applied to a real-world example (education reform in the United Kingdom), what forms of moderation are evident?
Estimating vPGS Weights in the UK Biobank
We build the two types of scores—a typical PGS (hereafter, mPGS) for levels of an outcome and a vPGS for variability in an outcome—for comparison, using the UK Biobank, a data set containing approximately 500,000 individuals from across the United Kingdom. We limit the sample to respondents who passed quality control and were of British ancestry based on information provided by the UK Biobank, leaving us with 408,219 in our final analytic sample. Further information regarding sample construction and quality control can be found in section S.1 of the online appendix.
The UK Biobank data set is large enough that we can divide the sample into training and test sets while maintaining sufficient statistical power for fitting GWAS and vQTL. We produce training and test sets by randomly sampling respondents. The training set includes 80% of the White British subsample of the UK Biobank, and the test set contains the remaining 20%.
We analyze four outcomes: height, BMI, educational attainment, and the number of children ever born (a measure of fertility). We calculate the inverse normal transformations of the outcomes. We also z-score traits to create a second set of dependent variables to be used in the squared z score analyses. Unless specified as z-scored, a trait/outcome should be assumed to be inverse normal transformed.
First, for each outcome, we run a regression predicting the inverse normal of that outcome, such that the weights reflect the contribution of each genetic locus to the mean level of the outcome. We use the software PLINK (version 1.9) for these regressions, controlling for age, sex, genotyping array, and the first 40 principal components.4 These regression weights, which we refer to as weights for levels PGSs, correspond to the traditional tools used in G×E research.
Second, an identical set of regressions predict the squared z score of the outcome, corresponding with the Yang et al. (2012) method for vQTL analysis. Again, we include age, sex, array, and the first 40 principal components as controls. Because the z score is squared, values that are the same number of standard deviations above or below the mean will receive the same value. Thus, the regression predicts distance from the mean rather than the mean level itself. However, as argued earlier, the transformed version of the outcome variable will still be correlated with the mean. We refer to the weights and PGSs produced by these regressions as squared z.
Third, we run regressions for each outcome on the sibling subsample of the UK Biobank (N = 19,294 White British sibling pairs) while controlling for the same set of covariates noted earlier. For each sibling pair, we calculate the intrasibling mean and standard deviation. We then residualize the standard deviation on the mean and use this new residualized standard deviation as our outcome variable. Because each sibling pair is represented twice in the data, we use only one member of each sibling pair in the final regressions. We refer to the weights and PGSs produced by this method as sibling standard deviation.
Fourth, we use OSCA (www.cnsgenomics.com/software/osca; Wang et al. 2019) to conduct a mean–variance vQTL analysis using Levene's test for variance heterogeneity. Instead of estimating the effect size and standard error, Levene's test assesses the equality of variances between sample groups (in this case, those that have a given allele combination and those that do not). Thus, following Zhang et al. (2019), OSCA rescales the test statistics (p value) to the effect size and standard error using z statistics. We refer to the weights and PGSs produced by this method as Levene's. OSCA requires the researcher to distinguish continuous from discrete controls. Thus, we treat the binary variables sex and array as discrete and treat age and the principal components as continuous.
Fifth, we use a mean–variance vQTL analysis using heteroskedastic linear mixed models (Young et al. 2018). This method produces additive (mean) and variance effects. It also allows us to derive dispersion effects, which are variance effects that are independent of the mean effect. We use the weights produced by these dispersion effects in subsequent analyses, referring to them using the name given by the authors—the heteroskedastic linear mixed model (HLMM). We include age, sex, array, and the first 40 principal components as both mean and variance covariates.
Finally, we ensure that results comparing the different vPGSs are due to true differences between the scores, not due to differences that arise from the smaller sample size and lower precision in the sibling-based method: for every vQTL or GWAS analysis run on the full sample, we run an analogous analysis on a randomly chosen subsample, with the number of respondents set to be equal to the number of sibling pairs in the UK Biobank.
Constructing vPGS in the HRS
Using the weights from the previous step, we construct vPGSs in the HRS. We restrict the HRS sample to self-identified European Americans who passed the HRS preprocessing procedure and were within 2 standard deviations of the mean of the first two genetic principal components of their racial/ethnic group. This leaves 10,554 respondents in the genotyping sample. We further restrict the sample to respondents with at least one wave of BMI collected (the following variable, with * indicating the wave: r*BMI). After this exclusion, 5,744 respondents remain.5
The replication code contains details on the outcome variable construction. Most notably, because the HRS is time-varying with several waves, we take the most recently observed value of the outcome for each respondent.
Relationship Between mPGS/vPGS and Levels of an Outcome
We use three tools to explore whether vPGSs can capture genetic contributions to variability in an outcome, distinct from genetic contributions to levels of an outcome. First, we estimate the following linear regression (Model 1):
where i indexes a respondent, PGS indicates the levels PGS (mPGS) or a vPGS, and Y is levels of the outcome trait (converted to the standard normal scale). Xi includes the first five principal components. Our coefficient of interest is b1. We expect that the levels PGS will significantly predict levels of a trait. We also conduct a robustness check in which, in addition to controlling for age and sex in the construction of the vPGS weights using the UK Biobank, we control for these covariates in the regression. We find no substantive differences with these additional covariates.
SNPs are of four types, categorized by whether they affect (1) neither levels of an outcome nor variance in an outcome, (2) levels of an outcome but not its variance, (3) variance in an outcome but not levels of an outcome, or (4) both variance in and levels of an outcome. For traits that are not normally distributed, isolating SNPs of the third type is more difficult because any SNP that affects the mean of an outcome will also affect the outcome's variance (Young et al. 2018). Here, we aim to construct plasticity scores that capture genetic contributions to variability. We therefore use both SNPs of type three (affecting variance but not the mean) purged of general mean–variance artifacts from nonnormal distributions and SNPs of type four. Because the distribution of type three and type four SNPs should be constant across each of the vPGSs we compare, we interpret a larger positive coefficient on the vPGS from a regression of levels of an outcome on the vPGS as evidence that the vQTL method is either picking up a large share of type four SNPs relative to type three SNPs or has weights that fail to adjust for the mechanical relationship between mean and variance. For these regressions, our samples are the HRS and the test set from the UK Biobank.
Second, we examine whether these patterns of correlation after constructing the vPGS in each sample are also present in the underlying weights that summarize each SNP's contribution. We use linkage disequilibrium score regression (Bulik-Sullivan, Loh et al. 2015) for two purposes: (1) to compare the heritability of levels of an outcome to the heritability of plasticity in that outcome measured using the four techniques discussed earlier (squared z, Levene's test, HLMM, and sibling standard deviation), and (2) to compare the underlying genetic correlations between levels and variability for each outcome as well as across outcomes in levels and variability (Bulik-Sullivan, Loh et al. 2015; for a social science application of genetic correlations, see Wedow et al. 2018). Finally, the analyses of heritability show very low heritability for vPGSs, such as HLMM and sibling standard deviation, which are less confounded with levels of an outcome. Therefore, we conduct two validation exercises to investigate whether the vPGSs are capturing some form of plasticity and to confirm that the two scores do not represent random noise. Using the HRS, which differs from the UK Biobank in having repeated measurements of the phenotype over time, we explore the relationship between each vPGS and two forms of plasticity. The first form of plasticity is within-person variability, which we measure using two versions of the within-person standard deviation in BMI: a version using raw values of BMI and a version detrended using age-specific trends. The second form of plasticity is unexplained population-level variability, operationalized by regressing BMI on age, sex, and the first five principal components and using the squared residual from that regression as the outcome.
Together, these analyses show that two of the PGSs for plasticity—one constructed using the squared z score of an outcome and the other constructed using Levene's test for variance heterogeneity—fail to summarize genetic contributors unique to variability in an outcome. However, two of the PGSs for plasticity—one summarizing dispersion effects and the other constructed from sibling variation—capture more distinctive genetic contributions. We focus on these two tools for the application discussed later, but we include results with all four scores in the online appendix.
mPGSs Versus vPGSs as Moderators of a UK Education Reform
We use these preferred plasticity scores to study heterogeneous effects of a large-scale education reform initiated in 1972 in England, Scotland, and Wales that extended the age to which students were legally required to stay in school from 15 to 16 years old (Barcellos et al. 2018). Using Barcellos et al.’s (2018) regression discontinuity design, we evaluate the extent to which the two effective vPGSs can detect different forms of genetic moderation of this educational shock compared with the standard levels PGSs. We examine two cases using the same reform as the exogenous environmental context: (1) following Barcellos et al. (2018), we examine the moderating role of genetic risk of obesity in the relationship between education and body size; and (2) we evaluate the influence of genetic plasticity on downstream educational outcomes. Both outcomes—body size and educational attainment—are affected by gene and environment interactions but may differ in the kinds of G×E effects they exhibit.
More specifically, following Barcellos et al. (2018), we use two-stage least-squares regression to instrument whether the individual stayed in school until age 16 (Educ16) based on whether they were younger than 16 when the reform was enacted (post-reform) and were therefore legally required to stay the extra year. We then interact this residualized version of Educ16 with each of the BMI PGSs to evaluate whether the interaction between educational attainment and genetic propensity is significant, given that both factors affect health outcomes. We also report a reduced-form regression in which each PGS is directly interacted with the post-reform variable.
The main health outcome is body size, a weighted combination of BMI, waist-to-hip ratio, and body fat percentage. Barcellos et al. (2020) identified the point in the body size distribution at which they should have the most power to detect an effect. We use this distributional threshold to create an above-threshold version of body size, the results for which can be compared with the continuous version of the outcome.
In a second set of analyses, we examine the role of genes in the impact of an additional year of education on downstream educational outcomes. Here, rather than instrumenting educational attainment, we look specifically at whether the effect of one's genotype on educational outcomes differs depending on whether one was born before or after the reform. This analysis is akin to the reduced-form regressions reported for BMI. Following the previous literature (Barcellos et al. 2018) finding no effect of the reform on the likelihood of attending college, we focus on outcomes for those who left school at age 18 or younger. We examine the effect of the reform on four educational outcomes used in the literature. First, we consider whether the respondent left school at age 16 or later, given that some students opted out of attending school through age 16 despite the reform. Second, we consider whether respondents achieved any certifications as a result of their education (certification). For the last two outcomes, we explore whether they achieved specific certifications: O levels or Certificate of Secondary Education (CSE; which were equivalent and later replaced by the General CSE in 1988) and A levels.
Controls, for both sets of models, include a quadratic term for the number of days that passed from when the respondent was born until the time of reform (to factor out any time trends), dummy variables for the month born, sex, age at the time of assessment (in days), age squared, dummy variables for country of birth, the first 15 principal components, the mPGS, the interaction between those principal components and Educ16 (or, in the reduced-form regressions, post-reform), and the interaction between the mPGS and Educ16. We use triangular kernel weights to assign more weight to observations closer to the reform, and we allow time trends to vary before and after the reform (Barcellos et al. 2018). As we will show, the squared z score and Levene's test plasticity scores fail to capture variability distinct from mean effects; we therefore present these results only in the online appendix (section S.8).
How Do the Plasticity Scores Relate to Levels of a Trait?
The first question that arises when using plasticity scores for G×E research is whether the plasticity score simply captures genome-wide contributions to levels of an outcome as opposed to genome-wide contributions to variability in an outcome. If the plasticity score looks very similar to social scientists’ standard tool for G×E research, it is less useful as a new tool for capturing distinctive forms of genetic moderation. Summarizing the results of Model 1, Figure 1 depicts whether the plasticity score significantly predicts levels of an outcome. The left panel shows the results in the smaller full HRS sample; the right panel displays the results in the larger sample of the UK Biobank test set. Each bar represents one of the four outcomes of interest to demographers: BMI, education, height, and the number of children ever born.
As expected, the levels PGSs predict levels of an outcome. However, in three of the four traits, the squared z vPGS significantly predicts levels of a trait. In two of the four traits, the Levene's test vPGS significantly predicts levels of a trait. In contrast, the sibling standard deviation and HLMM vPGS were significant for only one of the four traits in the HRS sample, although they were significant for more traits in the UK Biobank test set.
Overall, the results show that researchers hoping to use plasticity scores for G×E research should be careful to choose a tool that captures distinctive genetic contributions to plasticity apart from genetic contributions to an outcome's mean. Online appendix section S.3 presents additional results, including those comparing the scores’ significance when we match the sample size of the nonsibling scores to that in the sibling-based analyses.6
The Genetic Correlation Between mPGS and vPGS
The previous results show that when we aggregate weights from the different vQTL methods to produce a PGS for plasticity, one score—the squared z vPGSs—performs similarly to an mPGS in predicting levels of a trait. As a result, the score may be a less useful tool for examining certain forms of gene–environment interplay because it fails to capture distinctive genetic contributions to variability. Here, we examine whether tools aimed at using weights from mPGSs can be used to infer (1) heritability and (2) genetic correlation to examine the genetic architecture of plasticity.
Section S.5 of the online appendix contains the results from using linkage disequilibrium score regression to examine the univariate heritability of each of the four outcomes in terms of levels of an outcome (replicating previous work) and in terms of plasticity in that outcome (extending prior work). We find valid estimates of heritability only for the squared z score, possibly because the method requires weights with a certain degree of precision to generate nonzero heritability. Future research should investigate better methods for estimating heritability for less well-powered vQTL weights.
Because the squared z score was the only one with nonzero heritability across outcomes, we examine the genetic correlation between levels of each outcome (replicating past work by Bulik-Sullivan, Finucane et al. 2015), plasticity in each outcome (extending their work), and levels and plasticity. Notably, we calculate these genetic correlations before estimating the scores in a sample. Therefore, the correlations reflect a shared genetic architecture between contributors to levels of an outcome and contributors to variability in an outcome.
The top panel of Table 2 shows between-trait patterns of genetic correlation for the levels PGSs7 and for the squared z vPGSs. The analysis investigates whether there are similar patterns of cross-trait genetic correlation in variability in addition to levels. The results show that the patterns are similar except for the relationship between BMI and height. Levels of height and BMI are genetically negatively correlated: those with a genetic propensity for being taller also have a genetic propensity toward lower BMI, replicating the relationship that Bulik-Sullivan, Finucane et al. (2015) found. However, plasticity in height and BMI is positively correlated. As we discuss in the Discussion section, future research should explore this relationship, which could reflect differential sensitivity to environmental inputs to growth.
The bottom panel of the table also highlights the underlying genetic correlation between an mPGS meant to purge variance effects (the additive weights from the HLMM method) and each vPGS within a trait. The table shows patterns that are generally negative. Section S.5 of the online appendix provides a visual summary of these correlations.
Validation That the vPGS Correlates With Plasticity
The previous sections showed that the squared z score vPGS, in particular, is highly correlated with levels of an outcome and that the vPGS is the only measure to produce estimates that are precise enough to examine heritabilities and genetic correlations. Yet, the noisiness of the vPGS estimates raises the question of whether the previous results might stem from scores that reflect random noise rather than true contributions to variability.
Here, we report the results of the validation exercise. Figure 2 shows the results of relating each vPGS to within-person variability in BMI among respondents with at least three waves of BMI observations (see section S.6 of the online appendix for details of the sample construction and the full regression results). Individuals with higher vPGSs have significantly more over-time variability in BMI than individuals with lower vPGSs. Section S.6 of the online appendix discusses a validation exercise in which we regress the squared residual of BMI on each of the vPGS.
Summary: Which vPGS Can Serve as New Tools for G×E?
Taken together, our results show that the squared z vPGS is less useful as a new tool for examining the gene–environment interplay. The vPGS significantly predicts levels of an outcome across four diverse traits (height, BMI, education, and the number of children ever born). The squared z score exhibits patterns of underlying genetic correlation similar to those between levels of a trait. In contrast, the sibling standard deviation method (Conley et al. 2018) and dispersion weights (Young et al. 2018) show better properties for capturing distinctive genetic contributions to plasticity that appear to be less confounded with levels of an outcome.
Why might past research studying methods for vQTL have overlooked the ways that certain methods fail to capture distinctive genetic effects? To answer this question, section S.7 of the online appendix discusses the way researchers have commonly assessed whether a method for detecting vQTLs overlaps with a method for detecting mean effects or normal top SNPs: examining whether the two methods select similar SNPs as top hits.8 The results show that although the top-hits comparison reveals some degree of overlap—for instance, the squared z score and Levene's top hits display more overlap with the levels top hits than the other methods—this comparison might be too conservative. In particular, an mPGS and a vPGS might not have overlap in SNPs with p values below a threshold, but the two might have overlap in SNPs with nonzero weights that contribute to the final scores. Overall, the results show that social scientists interested in using the vPGS as a new tool should look carefully at whether it is distinctive from, or nearly identical to, the mPGS for that outcome.
Using the vPGS to Examine Heterogeneous Impacts of Education Reform
Having examined the properties of the different PGSs for plasticity, their distinctiveness from the standard tool for G×E research (mPGS), and their relationships to one another, we can use them to adjudicate between different mechanisms of genetic moderation. Specifically, we have argued that interactions between mPGSs and the environment in predicting an outcome capture outcome moderation, whereas vPGSs can capture a different form of heterogeneous effects.
With this in mind, we turn to a practical example using mPGSs and different vPGSs to explore which kind of genetic moderation is at play. We build on the research of Barcellos et al. (2018), who investigated the impact of an educational reform raising the required age of schooling from 15 to 16 years in England, Scotland, and Wales. Unlike measures such as an individual's educational attainment or their parent's educational attainment, which can lead to false-positive gene–environment interactions through confounding between the environmental shock and parent genotype (see Conley 2016), the reform's timing is exogenous to genotype. The reform allows us to study the different forms of genetic moderation and the performance of different vPGS measures in an applied example.
We evaluate two sets of outcomes. First, following Barcellos et al. (2018), we evaluate whether there was genetic moderation of the reform's impact on health outcomes as measured by body size. If this moderation takes the form of outcome moderation, then the mPGS for BMI would significantly interact with the reform; the reform might have a larger impact on those with a low genetic propensity or a high genetic propensity toward obesity (amplifying their advantage or buffering their risk, respectively). Alternately, if this moderation takes the form of variability moderation (a significant interaction between the vPGS and the post-reform indicator), then the reform has larger impacts on those who, across many shocks, experience more swings in BMI.
Second, extending the work of Barcellos et al. (2018), we evaluate whether there was genetic moderation of the reform's impact on educational outcomes. A larger impact of the reform on those with especially high or low genetic propensities toward educational attainment would indicate outcome moderation. Under the education as the great equalizer hypothesis (Barcellos et al. 2020), we might expect that those with the lowest educational PGS are the most impacted by the extra year of mandatory education. Heterogeneous effects of the reform on individuals with different underlying genetic plasticity would indicate variability moderation.9
Genetic Moderation of the Education Reform's Impact on Body Size
For the body size models, presented in Tables 3 and 4, the interaction between the mPGS and being exposed to the reform is the only statistically significant result. As Table 3 shows, only one of the PGSs for plasticity shows a significant interaction: the squared z score vPGS (see section S.8 of the online appendix, which uncovers a marginally significant result on the above-threshold outcome). This result is likely due to the high correlation between the squared z score vPGS and the mPGS.
These results suggest that outcome moderation (rather than plasticity) is the main form of genetic moderation of the education reform's effect on these measures of health. As shown in Figure 3, the reform has larger impacts on reducing obesity-related measures among those with higher genetic propensities toward obesity. Largely replicating those of Barcellos et al. (2018), our results show that the mPGS those authors used corresponds to the type of genetic moderation that unfolded.
Genetic Moderation of the Education Reform's Impact on Educational Attainment
Next, we examine the reform's impact on educational attainment. We find a significant interaction between the HLMM PGS for plasticity and post-reform when predicting three of the four educational outcomes: left school at age 16 or later (Educ16), certification, and O levels or CSE. The interactions between the HLMM plasticity score and post-reform remain significant when controls are included. By contrast, we find significant interactions between the relevant mPGS and post-reform for only one of the four outcomes: left school at age 16 or later. The results are presented in Table 5 and Figure 4, which compare the predicted educational outcomes for children in the lower, middle, and upper terciles of the HLMM vPGS distribution before and after the reform. For the significant interactions, the results show that those with higher HLMM PGSs attain lower levels of educational outcomes before the reform but equal levels of educational outcomes after the reform, potentially because they had enhanced sensitivity to the positive effects of the reform. Because results presented earlier show that the HLMM plasticity scores capture genetic contributions to plasticity in outcomes distinct from genetic contributions to the conditional mean, we are more confident that the effect is a true positive association.
Recognizing the biosocial nature of most outcomes of interest to demographers, social scientists are increasingly interested in how genetic variation moderates the impact of life course events that range from society-wide education reforms to targeted policy interventions aimed at specific subgroups. That is, in addition to estimating direct main effects of genotypes and environments, social and behavioral scientists often seek to model the mutual dependence of nature and nurture. Numerous metaphors have been offered for this causal model of human traits—for example, genetics as a lens (Domingue et al. 2020) or genetics as a prism refracting environmental influences into heterogeneous treatment effects (Conley and Fletcher 2018).
Summary of Contributions
In this study, we argue that social scientists’ workhorse measure of genotypes in G×E research—a genetic summary measure that reflects genetic contributions to levels of an outcome—commits those researchers to an implicit model of the genetic moderation of environments. The model corresponds to what we call outcome moderation and what others have recently called dimmer-type moderation (Domingue et al. 2020). Although this model may characterize some forms of gene–environment interplay, there are likely other forms that cannot be captured by summary measures constructed from aggregating effects on an outcome's mean.
We propose the use of PGSs for plasticity as an addition to social scientists’ toolbox. Focused on best practices, we show how conflation between genetic effects on an outcome's mean and effects on that outcome's variance begin with SNP-level analyses but then appear in the constructed scores. The conflation also makes it difficult to investigate whether plasticity in outcomes such as BMI displays different patterns of heritability than levels of those outcomes, although an initial analysis of genetic correlations shows an interesting flip, with levels of BMI and height negatively genetically correlated but plasticity in the two positively correlated. As a whole, we argue that researchers interested in a PGS for plasticity as a distinctive summary measure of genotype should be careful to construct scores based on weights from methods that try to adjust for false-positive effects on the mean.
Then, applying the scores to a real-world application, we show how adding an E × vPGS analysis to an E × mPGS analysis can detect a particular type of G×E interaction that deploying only an mPGS would obscure. Building on Barcellos et al.’s (2018) study of the effects of an education reform on health, we show that in line with their results but contrary to our priors, outcome moderation best characterizes the reform's impact on health outcomes. However, genetic plasticity might better explain the reform's closing of gaps in educational attainment between low- and high-plasticity youth. These results show that researchers cannot know in advance with great certainty which form of moderation will be operative and thus should test for both forms.
Limitations and Directions for Future Research
The first limitation is that our application of the PGSs for plasticity is limited to one E: an education reform in the United Kingdom. Different environmental treatments could be more or less likely to exhibit moderation by a person's plasticity. Because of the issues others have raised about false positives that occur when researchers think they are detecting G×E effects but instead are detecting unobserved confounding between genotype and environment (Conley 2016; Domingue et al. 2020), we prioritize studying the effect of an E that is clearly causally identified over examining how multiple, potentially confounded Es interact with each focal vPGS. Future research leveraging other natural experiments that alter environments should incorporate plasticity scores to investigate their relevance for other contexts.
Second, genetic contributions to plasticity can be thought of in different ways. For example, within-individual plasticity would require an estimation strategy that trains the input weights to the PGS on repeated measures of the same outcome within an individual (e.g., variation in BMI across many years). This form of plasticity is a promising avenue for future research. Predicting within-person (or within-family) variation over time without having to know explicitly what the fluctuating environmental factors are may prove to be a useful exploratory exercise before researchers try to hypothesize about specific factors in the environment that may be causing the fluctuation in genotypically plastic individuals. Moreover, a within-person variability score could be used to identify in advance those individuals who might be responsive to an intervention—for example, a drug trial or an educational intervention. The advantages of identifying such individuals include increased statistical power for the identification of effects in a pilot study before investing in a larger, more costly study. Unfortunately, data sources, such as the UK Biobank, that contain a sample large enough to estimate new weights for PGSs lack large-scale repeated measures of the same individuals. Once these data sources become available, future research can construct scores better designed for this form of plasticity.
Third, we might imagine two forms of plasticity. One form of plasticity is trait specific and occurs in response to various environmental triggers. For example, an individual with high BMI plasticity might have more BMI variability in response to many different environmental shocks (e.g., education reform, changes in food landscape, changes in peer group eating behavior). Another form of plasticity may be both trait specific and environment specific. Rather than having generally high BMI plasticity, an individual might have high BMI plasticity in response to a certain type of environmental trigger. The present approach to estimating variance-affecting SNP weights and constructing cross-environmental vPGSs captures the first type of plasticity but fails to capture the second. For the second type of plasticity, researchers in statistical genetics use flexible machine-learning methods to focus on a specific E (i.e., environmental shock), interact that E with many SNPs, and use regularization and other methods to find top-performing SNP × E interactions.10 The weights from those methods focused on interactions between a specific E and each SNP could be used to estimate a vPGS specific to certain environmental triggers.
Fourth, the present paper focuses on scores that can be interacted with a specific measure of environment. However, twin- and other pedigree-based approaches to estimating heritability have long been deployed to estimate G×E—for example, by assessing whether the additive heritability estimate changes in the face of differing social conditions, such as social class background or birth cohort (e.g., Boardman et al. 2010; Vink and Boomsma 2011). Newer methods, such as genome-based restricted maximum likelihood methods, allow for similar analyses under the same general framework in which a shift in the additive (SNP) heritability is evidence of G×E (e.g., Rimfeld et al. 2018). However, because these methods take an approach of variance decomposition, they cannot distinguish between whether the differences in heritability result from a change in the genetic variance or a corresponding shift in the environmental variance. More importantly, by testing for changes in the total, additive heritability—as in G×E studies using PGSs based on levels regressions—these approaches may be overlooking important genetics × environment interactions that do not result from differences in the predictive power of levels’ effects. One way to think about the plasticity or variance effect as it interacts with the environment is as an environmental (and genetic) epistasis term. That is, it is a nonadditive effect that is not captured in traditional models. The goal of using vQTL methods is to capture this nonadditive effect.
Finally, there is growing attention to how standard PGSs (mPGS) do not represent “pure” genetic measures of propensities; instead, the weights reflect a combination of direct genetic effects and biases from population stratification and genetic nurture (Kong et al. 2018; Trejo and Domingue 2019; Zaidi and Mathieson 2020). Because of these issues, the ideal design for generating vQTL weights is having the genotypes of two or more siblings along with the parent genotype or having the genotype of three or more siblings and being able to add a fixed effect for the sibling pair. However, these methods require sufficiently large samples with one of those family-based structures. In the present article, the sibling standard deviation method provides one approach to addressing bias in the vQTL weights, but future research should explore changes in vQTL weights generated using a family-based design.
In sum, this work aims to equip demographers and social scientists with an additional tool for studying the interplay between genes and environment—one that captures a broader range of how these interactions play out in applied settings. Our article suggests that as cohort studies make PGSs available to applied researchers, they should complement the mPGS scores that they are currently releasing with scores aimed at capturing genetic contributions to plasticity.
The authors thank members of the Conley Biosociology Lab and the University of Wisconsin Social Genomics workshop for helpful feedback on the project. Results from this research were presented at the National Institute on Aging–supported 2018 Integrating Genetics and the Social Sciences conference (R13-AG062366) at the University of Colorado Boulder.
As Domingue et al. (2020:469) stated, “When considering lenses, the relative effect of a given genotype may be positive for a ‘low’ level of the relevant environmental exposure and negative for ‘high’ levels of the exposure, or vice versa.”
This approach complements Boardman et al.’s (2014) study of the genetic moderation of specific environmental shocks. They used a genome-wide gene-by-environment interaction approach, which regresses the level of an outcome (in their case, BMI) on each SNP’s interaction with an environmental moderator (in their study, education). They noted the approach’s promise for capturing GxE but also its challenges with statistical power.
We do not include two other methods in our review because of their similarity to the four we include: (1) the new deviation regression model (Marderstein et al. 2021), which models the absolute difference between an individual’s phenotype value and the phenotype medians within each genotype; and (2) the double generalized linear model (Ronnegard and Valdar 2011).
Because of multicollinearity issues, we excluded higher order age variables and terms with interactions (age squared, age squared × sex), which have been used in other studies (e.g., Young et al. 2018).
We used this approach instead of imputation because missingness was in the outcome variable rather than in a predictor.
This robustness check helps guard against a finding that the sibling standard deviation score does not significantly predict levels of an outcome while the nonsibling scores do as a result of inadequate power for the sibling score compared with the scores estimated for a larger sample. The fact that the patterns hold for the matched sample size supports our claim that the squared z and Levene’s test scores are less useful as distinctive tools.
These findings replicate results from Bulik-Sullivan, Finucane et al. (2015) for overlapping outcomes and extend their analysis to look at additional outcomes, such as the number of children ever born.
Researchers have often validated their proposed methods both by finding top hits and with simulations comparing the methods, but those simulations also largely focus on one or two top causal SNPs.
Here, regressions that use each vPGS also control for the relevant mPGS to ensure that the observed effects do not simply reflect outcome moderation. To clarify how the inclusion of mPGSs affects these results, we report the results regressions in which an mPGS is not controlled for in section S.8 of the online appendix. There, we also report the full regression results for the G×E analyses reported in the main text.
For an early application of the third type of research and a discussion of the associated challenges, see Boardman et al. (2014). Also see Frost et al. (2016), who used elastic net penalized regression to zero out many of the SNP × E interactions.