Read Genome Annotations (GFF3) as a Spark DataFrame¶
GFF3 (Generic Feature Format Version 3) is a 9-column tab-separated text file format commonly used to store genomic annotations.
Typically, the majority of annotation data in this format appears in the ninth column, called attributes
, as a semi-colon-separated list of <tag>=<value>
entries. If Spark’s standard csv
data source is used to read GFF3 files, the whole list of attribute tag-value pairs will be read as a single string-typed column, making queries on these tags/values cumbersome.
To address this issue, Glow provides the gff
data source. In addition to loading the first 8 columns of GFF3 as properly typed columns, the gff
data source is able to parse all attribute tag-value pairs in the ninth column of GFF3 and create an appropriately typed column for each tag. In each row, the column corresponding to a tag will contain the tag’s value in that row (or null
if the tag does not appear in the row).
Like any Spark data source, reading GFF3 files using the gff
data source can be done in a single line of code:
df = spark.read.format("gff").load(path)
The gff
data source supports all compression formats supported by Spark’s csv
data source, including .gz
and .bgz
files. It also supports reading globs of files in one command.
Note
The gff
data source ignores any comment and directive lines (lines starting with #
) in the GFF3 file as well as any FASTA lines that may appear at the end of the file.
Schema¶
1. Inferred schema¶
If no user-specified schema is provided (as in the example above), the data source infers the schema as follows:
The first 8 fields of the schema (“base” fields) correspond to the first 8 columns of the GFF3 file. Their names, types and order will be as shown below:
|-- seqId: string (nullable = true) |-- source: string (nullable = true) |-- type: string (nullable = true) |-- start: long (nullable = true) |-- end: long (nullable = true) |-- score: double (nullable = true) |-- strand: string (nullable = true) |-- phase: integer (nullable = true)
Note
Although the
start
column in the GFF3 file is 1-based, thestart
field in the DataFrame will be 0-based to match the general practice in Glow.The next fields in the inferred schema will be created as the result of parsing the
attributes
column of the GFF3 file. Each tag will have its own field in the schema. Fields corresponding to any “official” tag (those referred to as tags with pre-defined meaning) come first, followed by fields corresponding to any other tag (“unofficial” tags) in alphabetical order.The complete list of official fields, their data types, and order are as shown below:
|-- ID: string (nullable = true) |-- Name: string (nullable = true) |-- Alias: string (nullable = true) |-- Parent: array (nullable = true) | |-- element: string (containsNull = true) |-- Target: string (nullable = true) |-- Gap: string (nullable = true) |-- DerivesFrom: string (nullable = true) |-- Note: array (nullable = true) | |-- element: string (containsNull = true) |-- Dbxref: array (nullable = true) | |-- element: string (containsNull = true) |-- OntologyTerm: array (nullable = true) | |-- element: string (containsNull = true) |-- Is_circular: boolean (nullable = true)
The unofficial fields will be of
string
type.Note
If any of official tags does not appear in any row of the GFF3 file, the corresponding field will be excluded from the inferred schema.
The official/unofficial field name will be exactly as the corresponding tag appears in the GFF3 file (in terms of letter case).
The parser is insensitive to the letter case of the tag, e.g., if the
attributes
column in the GFF3 file contains bothnote
andNote
tags, they will be both mapped to the same column in the DataFrame. The name of the column in this case will be eithernote
orNote
, chosen randomly.
2. User-specified schema¶
As with any Spark data source, the gff
data source is also able to accept a user-specified schema through the .schema
command. The user-specified schema can have any subset of the base, official, and unofficial fields. The data source is able to read only the specified base fields and parse out only the specified official and unofficial fields from the attributes
column of the GFF3 file. Here is an example of how the user can specify some base, official, and unofficial fields while reading the GFF3 file:
mySchema = StructType(
[StructField('seqId', StringType()), # Base field
StructField('start', LongType()), # Base field
StructField('end', LongType()), # Base field
StructField('ID', StringType()), # Official field
StructField('Dbxref', ArrayType(StringType())), # Official field
StructField('mol_type', StringType())] # Unofficial field
)
df_user_specified = spark.read.format("gff").schema(mySchema).load(path)
Note
The base field names in the user-specified schema must match the names in the list above in a case-sensitive manner.
The official and unofficial fields will be matched with their corresponding tags in the GFF3 file in a case-and-underscore-insensitive manner. For example, if the GFF3 file contains the official tag
db_xref
, a user-specified schema field with the namedbxref
,Db_Xref
, or any other case-and-underscore-insensitive match will correspond to that tag.The user can also include the original
attributes
column of the GFF3 file as a string field by includingStructField('attributes', StringType())
in the schema.