Split Multiallelic Variants

Splitting multiallelic variants to biallelic variants is a transformation sometimes required before further downstream analysis. Glow provides the split_multiallelics transformer to be applied on a variant DataFrame to split multiallelic variants in the DataFrame to biallelic variants. This transformer is able to handle any number of ALT alleles and any ploidy.

Note

The splitting logic used by the split_multiallelics transformer is the same as the one used by the vt decompose tool of the vt package with option -s (note that the example provided at vt decompose user manual page does not reflect the behavior of vt decompose -s completely correctly).

The precise behavior of the split_multiallelics transformer is presented below:

  • A given multiallelic row with \(n\) ALT alleles is split to \(n\) biallelic rows, each with one of the ALT alleles of the original multiallelic row. The REF allele in all split rows is the same as the REF allele in the multiallelic row.

  • Each INFO field is appropriately split among split rows if it has the same number of elements as number of ALT alleles, otherwise it is repeated in all split rows. The boolean INFO field splitFromMultiAllelic is added/modified to reflect whether the new row is the result of splitting a multiallelic row through this transformation or not. A new INFO field called OLD_MULTIALLELIC is added to the DataFrame, which for each split row, holds the CHROM:POS:REF/ALT of its original multiallelic row. Note that the INFO field must be flattened (as explained here) in order to be split by this transformer. Unflattened INFO fields (such as those inside an attributes field) will not be split, but just repeated in whole across all split rows.

  • Genotype fields for each sample are treated as follows: The GT field becomes biallelic in each row, where the original ALT alleles that are not present in that row are replaced with no call. The fields with number of entries equal to number of REF + ALT alleles, are properly split into rows, where in each split row, only entries corresponding to the REF allele as well as the ALT allele present in that row are kept. The fields which follow colex order (e.g., GL, PL, and GP) are properly split between split rows where in each row only the elements corresponding to genotypes comprising of the REF and ALT alleles in that row are listed. Other genotype fields are just repeated over the split rows.

  • Any other field in the DataFrame is just repeated across the split rows.

As an example (shown in VCF file format), the following multiallelic row

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1
20      101     .       A       ACCA,TCGG       .       PASS    VC=INDEL;AC=3,2;AF=0.375,0.25;AN=8      GT:AD:DP:GQ:PL  0/1:2,15,31:30:99:2407,0,533,697,822,574

will be split into the following two biallelic rows:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1
20      101     .       A       ACCA    .       PASS    VC=INDEL;AC=3;AF=0.375;AN=8;OLD_MULTIALLELIC=20:101:A/ACCA/TCGG GT:AD:DP:GQ:PL  0/1:2,15:30:99:2407,0,533
20      101     .       A       TCGG    .       PASS    VC=INDEL;AC=2;AF=0.25;AN=8;OLD_MULTIALLELIC=20:101:A/ACCA/TCGG  GT:AD:DP:GQ:PL  0/.:2,31:30:99:2407,697,574

Usage

Assuming df_original is a variable of type DataFrame which contains the genomic variant records, an example of using this transformer for splitting multiallelic variants is:

df_split = glow.transform("split_multiallelics", df_original)
df_split = Glow.transform("split_multiallelics", df_original)

Tip

The split_multiallelics transformer is often significantly faster if the whole-stage code generation feature of Spark Sql is turned off. Therefore, it is recommended that you temporarily turn off this feature using the following command before using this transformer.

spark.conf.set("spark.sql.codegen.wholeStage", False)
spark.conf.set("spark.sql.codegen.wholeStage", false)

Remember to turn this feature back on after your split DataFrame is materialized.

Split Multiallelic Variants notebook