Merging Variant Datasets¶
You can use Glow and Spark to merge genomic variant datasets from non-overlapping sample sets into a multi-sample dataset. In these examples, we will read from VCF files, but the same logic works on DataFrames backed by other file formats.
First, read the VCF files into a single Spark DataFrame:
from pyspark.sql.functions import *
df = spark.read.format('vcf').load([path1, path2])
# Alternatively, you can use the "union" DataFrame method if the VCF files have the same schema
df1 = spark.read.format('vcf').load(path1)
df2 = spark.read.format('vcf').load(path2)
df = df1.union(df2)
The resulting DataFrame contains all records from the VCFs you want to merge, but the genotypes from different samples at the same site have not been combined. You can use an aggregation to combine the genotype arrays.
from pyspark.sql.functions import *
merged_df = df.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'))
Important
When reading VCF files for a merge operation, sampleId
must be the first field in the
genotype struct. This is the default Glow schema.
The genotypes from different samples now appear in the same genotypes array.
Note
If the VCFs you are merging contain different sites, elements will be missing from the genotypes
array after aggregation. Glow automatically fills in missing genotypes when writing to
bigvcf
, so an exported VCF will still contain all samples.
Aggregating INFO fields¶
To preserve INFO fields in a merge, you can use the aggregation functions in Spark. For instance, to
emit an INFO_DP
column that is the sum of the INFO_DP
columns across all samples:
from pyspark.sql.functions import *
merged_df = df.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles')\
.agg(sort_array(flatten(collect_list('genotypes'))).alias('genotypes'),
sum('INFO_DP').alias('INFO_DP'))
Joint genotyping¶
The merge logic in this document allows you to quickly aggregate genotyping array data or single sample VCFs. For a more sophisticated aggregation that unifies alleles at overlapping sites and uses cohort-level statistics to refine genotype calls, we recommend running a joint genotyping pipeline like the one included in the Databricks Runtime for Genomics.