Genome-wide Association Study Regression Tests¶
Glow contains functions for performing simple regression analyses used in genome-wide association studies (GWAS).
Tip
Glow automatically converts literal one-dimensional and two-dimensional numpy
ndarray
s of double
s
to array<double>
and spark.ml
DenseMatrix
respectively.
Linear regression¶
linear_regression_gwas
performs a linear regression association test optimized for performance
in a GWAS setting.
Example¶
from glow.wgr.functions import reshape_for_gwas
import numpy as np
import pandas as pd
from pyspark.ml.linalg import DenseMatrix
from pyspark.sql import Row
from pyspark.sql.functions import col, lit
# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)
# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = glow.transform('split_multiallelics', variants) \
.withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes')))) \
.cache()
# Read covariates from a CSV file and add an intercept
covariates = pd.read_csv(covariates_csv, index_col=0)
covariates['intercept'] = 1.
# Read phenotypes from a CSV file
pd_phenotypes = pd.read_csv(continuous_phenotypes_csv, index_col=0)
phenotypes = reshape_for_gwas(spark, pd_phenotypes)
# Run linear regression test
lin_reg_df = genotypes.crossJoin(phenotypes).select(
'contigName',
'start',
'names',
'label',
glow.expand_struct(glow.linear_regression_gwas(
col('gt'),
phenotypes.values,
lit(covariates.to_numpy())
))
)
Parameters¶
Name |
Type |
Details |
---|---|---|
|
|
A numeric representation of the genotype for each sample at a given site, for example the
result of the |
|
|
A matrix containing the covariates to use in the linear regression model. Each row in the
matrix represents observations for a sample. The indexing must match that of the |
|
|
A numeric representation of the phenotype for each sample. This parameter may vary for each
row in the dataset. The indexing of this array must match the |
Return¶
The function returns a struct with the following fields. The computation of each value matches the lm R package.
Name |
Type |
Details |
---|---|---|
|
|
The fit effect coefficient of the |
|
|
The standard error of |
|
|
The P-value of the t-statistic for |
Implementation details¶
The linear regression model is fit using the QR decomposition. For performance, the QR decomposition
of the covariate matrix is computed once and reused for each (genotypes
, phenotypes
) pair.
Logistic regression¶
logistic_regression_gwas
performs a logistic regression hypothesis test optimized for performance
in a GWAS setting.
Example¶
# Read a single phenotype from a CSV file
trait = 'Binary_Trait_1'
phenotype = np.hstack(pd.read_csv(binary_phenotypes_csv, index_col=0)[[trait]].to_numpy()).astype('double')
# Likelihood ratio test
lrt_log_reg_df = genotypes.select(
'contigName',
'start',
'names',
glow.expand_struct(glow.logistic_regression_gwas(
col('gt'),
lit(phenotype),
lit(covariates.to_numpy()),
'LRT'
))
)
# Firth test
firth_log_reg_df = genotypes.select(
'contigName',
'start',
'names',
glow.expand_struct(glow.logistic_regression_gwas(
col('gt'),
lit(phenotype),
lit(covariates.to_numpy()),
'Firth'
))
)
# Logistic regression with offset
# Read offset from a csv file
offset = np.hstack(pd.read_csv(offset_csv, index_col=0)[[trait]].to_numpy()).astype('double')
# LRT test with offset
lrt_log_reg_with_offset_df = genotypes.select(
'contigName',
'start',
'names',
glow.expand_struct(glow.logistic_regression_gwas(
col('gt'),
lit(phenotype),
lit(covariates.to_numpy()),
'LRT',
lit(offset)
))
)
# Firth test with offset
firth_log_reg_with_offset_df = genotypes.select(
'contigName',
'start',
'names',
glow.expand_struct(glow.logistic_regression_gwas(
col('gt'),
lit(phenotype),
lit(covariates.to_numpy()),
'Firth',
lit(offset)
))
)
Parameters¶
The parameters for the logistic regression test are largely the same as those for linear regression. The primary
differences are that the phenotypes
values should be in the set [0,1]
and that there is one additional
parameter test
to specify the hypothesis test method.
Name |
Type |
Details |
---|---|---|
|
|
A numeric representation of the genotype for each sample at a given site, for example the
result of the |
|
|
A matrix containing the covariates to use in the logistic regression model. Each row in the
matrix represents observations for a sample. The indexing must match that of the |
|
|
A numeric representation of the phenotype for each sample. This parameter may vary for each
row in the dataset. The indexing of this array must match the |
|
|
The hypothesis test method to use. Currently likelihood ratio ( |
|
|
An offset vector to be added to the linear predictor terms. The offset term is added with the fixed
coefficient 1 (its coefficient will not be estimated). This parameter may vary for each
row in the dataset. The indexing of this array must match the |
Tip
The offset
parameter is especially useful in incorporating the results of GloWGR with
binary phenotypes in GWAS. Please refer to GloWGR: Whole Genome Regression for details and
example notebook.
Return¶
The function returns a struct with the following fields. The computation of each value matches the glm R package for the likelihood ratio test and the logistf R package for the Firth test.
Name |
Type |
Details |
---|---|---|
|
|
Log-odds associated with the |
|
|
Odds ratio associated with the |
|
|
Wald 95% confidence interval of the odds ratio, |
|
|
p-value for the specified |
Implementation details¶
The logistic regression null model and fully-specified model are fit using Newton iterations. For performance, the null
model is computed once for each phenotype
and used as a prior for each (genotypes
, phenotypes
) pair.