Genome-wide Association Study Regression Tests¶

Glow contains functions for performing simple regression analyses used in genome-wide association studies (GWAS).

Tip

Glow automatically converts literal one-dimensional and two-dimensional numpy ndarray s of double s to array<double> and spark.ml DenseMatrix respectively.

Linear regression¶

linear_regression_gwas performs a linear regression association test optimized for performance in a GWAS setting.

Example¶

from glow.wgr.functions import reshape_for_gwas
import numpy as np
import pandas as pd
from pyspark.ml.linalg import DenseMatrix
from pyspark.sql import Row
from pyspark.sql.functions import col, lit

# Read in VCF file
variants = spark.read.format('vcf').load(genotypes_vcf)

# genotype_states returns the number of alt alleles for each sample
# mean_substitute replaces any missing genotype states with the mean of the non-missing states
genotypes = glow.transform('split_multiallelics', variants) \
  .withColumn('gt', glow.mean_substitute(glow.genotype_states(col('genotypes')))) \
  .cache()

# Read covariates from a CSV file and add an intercept
covariates = pd.read_csv(covariates_csv, index_col=0)
covariates['intercept'] = 1.

# Read phenotypes from a CSV file
pd_phenotypes = pd.read_csv(continuous_phenotypes_csv, index_col=0)
phenotypes = reshape_for_gwas(spark, pd_phenotypes)

# Run linear regression test
lin_reg_df = genotypes.crossJoin(phenotypes).select(
  'contigName',
  'start',
  'names',
  'label',
  glow.expand_struct(glow.linear_regression_gwas(
    col('gt'),
    phenotypes.values,
    lit(covariates.to_numpy())
  ))
)

Parameters¶

Name	Type	Details
`genotypes`	`array<double>` (or numeric type that can be cast to `double`)	A numeric representation of the genotype for each sample at a given site, for example the result of the `genotype_states` function. This parameter can vary for each row in the dataset.
`covariates`	`spark.ml` `Matrix`	A matrix containing the covariates to use in the linear regression model. Each row in the matrix represents observations for a sample. The indexing must match that of the `genotypes` array that is, the 0th row in the covariate matrix should correspond to the same sample as the 0th element in the `genotypes` array. This matrix must be constant for each row in the dataset. If desired, you must explicitly include an intercept covariate in this matrix.
`phenotypes`	`array<double>` (or numeric type that can be cast to `double`)	A numeric representation of the phenotype for each sample. This parameter may vary for each row in the dataset. The indexing of this array must match the `genotypes` and `covariates` parameters.

Return¶

The function returns a struct with the following fields. The computation of each value matches the lm R package.

Name	Type	Details
`beta`	`double`	The fit effect coefficient of the `genotypes` parameter.
`standardError`	`double`	The standard error of `beta`.
`pValue`	`double`	The P-value of the t-statistic for `beta`.

Implementation details¶

The linear regression model is fit using the QR decomposition. For performance, the QR decomposition of the covariate matrix is computed once and reused for each (genotypes, phenotypes) pair.

Logistic regression¶

logistic_regression_gwas performs a logistic regression hypothesis test optimized for performance in a GWAS setting.

Example¶

# Read a single phenotype from a CSV file
trait = 'Binary_Trait_1'
phenotype = np.hstack(pd.read_csv(binary_phenotypes_csv, index_col=0)[[trait]].to_numpy()).astype('double')

# Likelihood ratio test
lrt_log_reg_df = genotypes.select(
  'contigName',
  'start',
  'names',
  glow.expand_struct(glow.logistic_regression_gwas(
    col('gt'),
    lit(phenotype),
    lit(covariates.to_numpy()),
    'LRT'
  ))
)

# Firth test
firth_log_reg_df = genotypes.select(
  'contigName',
  'start',
  'names',
  glow.expand_struct(glow.logistic_regression_gwas(
    col('gt'),
    lit(phenotype),
    lit(covariates.to_numpy()),
    'Firth'
  ))
)

# Logistic regression with offset

# Read offset from a csv file
offset = np.hstack(pd.read_csv(offset_csv, index_col=0)[[trait]].to_numpy()).astype('double')

# LRT test with offset
lrt_log_reg_with_offset_df = genotypes.select(
  'contigName',
  'start',
  'names',
  glow.expand_struct(glow.logistic_regression_gwas(
    col('gt'),
    lit(phenotype),
    lit(covariates.to_numpy()),
    'LRT',
    lit(offset)
  ))
)

# Firth test with offset
firth_log_reg_with_offset_df = genotypes.select(
  'contigName',
  'start',
  'names',
  glow.expand_struct(glow.logistic_regression_gwas(
    col('gt'),
    lit(phenotype),
    lit(covariates.to_numpy()),
    'Firth',
    lit(offset)
  ))
)

Parameters¶

The parameters for the logistic regression test are largely the same as those for linear regression. The primary differences are that the phenotypes values should be in the set [0,1] and that there is one additional parameter test to specify the hypothesis test method.

Name	Type	Details
`genotypes`	`array<double>` (or numeric type that can be cast to `double`)	A numeric representation of the genotype for each sample at a given site, for example the result of the `genotype_states` function. This parameter can vary for each row in the dataset.
`covariates`	`spark.ml` `Matrix`	A matrix containing the covariates to use in the logistic regression model. Each row in the matrix represents observations for a sample. The indexing must match that of the `genotypes` array that is, the 0th row in the covariate matrix should correspond to the same sample as the 0th element in the `genotypes` array. This matrix must be constant for each row in the dataset. If desired, you must explicitly include an intercept covariate in this matrix.
`phenotypes`	`array<double>` (or numeric type that can be cast to `double`)	A numeric representation of the phenotype for each sample. This parameter may vary for each row in the dataset. The indexing of this array must match the `genotypes` and `covariates` parameters.
`test`	`string`	The hypothesis test method to use. Currently likelihood ratio (`LRT`) and Firth (`Firth`) tests are supported.
`offset` (optional)	`array<double>` (or numeric type that can be cast to `double`)	An offset vector to be added to the linear predictor terms. The offset term is added with the fixed coefficient 1 (its coefficient will not be estimated). This parameter may vary for each row in the dataset. The indexing of this array must match the `phenotypes`.

Tip

The offset parameter is especially useful in incorporating the results of GloWGR with binary phenotypes in GWAS. Please refer to GloWGR: Whole Genome Regression for details and example notebook.

Return¶

The function returns a struct with the following fields. The computation of each value matches the glm R package for the likelihood ratio test and the logistf R package for the Firth test.

Name	Type	Details
`beta`	`double`	Log-odds associated with the `genotypes` parameter, `NaN` if the fit failed.
`oddsRatio`	`double`	Odds ratio associated with the `genotypes` parameter, `NaN` if the fit failed..
`waldConfidenceInterval`	`array<double>`	Wald 95% confidence interval of the odds ratio, `NaN` s if the fit failed.
`pValue`	`double`	p-value for the specified `test`. For the Firth test, this value is computed using the profile likelihood method. `NaN` if the fit failed.

Implementation details¶

The logistic regression null model and fully-specified model are fit using Newton iterations. For performance, the null model is computed once for each phenotype and used as a prior for each (genotypes, phenotypes) pair.

Example notebook and blog post¶

A detailed example and explanation of a GWAS workflow is available here.

GWAS notebook¶

How to run a notebook Get notebook link