Binary encoding for human genetic data

abstract

The current representation of genetic data as [0,1,2] poses some key limitations on interpretation and analysis. The proposed solution treats the genetic data as a categorical variable belonging to the following categories; [A,T,C,G,m]. The categories represent at least one nucleotide for a given SNP, and the ‘m’ category represents whether the SNP is homozygous for a given individual. The proposed format will fix some of the limitations inherent to the current representation of genetic data.

limitations of the current representation of genetic data

The first problem with the [0,1,2] format is the inability to accurately capture in/del elements. This is because of the ambiguity of what ‘0’ means. As it could mean no SNP is detected (deletion) or it is a homozygous SNP. Replacing instances where no SNP is detected with NA, NaN, or NULL can result in a broken analysis.

The second issue is the inability to capture triallelic variants successfully. Because [0,1,2] encoding only represents 3 options, homozygous [0,2] and heterozygous [1]. For rare outcomes where triallelic variants might be associated with the outcome, the third variant is effectively censured.

binary encoding with examples

The proposed solution is to have the data more closely resemble the biological data. To start, let’s consider representing nucleic acids as categorical variables [A,T,C,G] and a given SNP for seven individuals [AT, AT, AA, TT, TT, TT, AA]. Now, for encoding, at least one copy of a nucleotide will be represented as 1, and no copies as 0. The Given SNP is then represented as

A1110001
T1101110

If we want to reconstruct the nucleic acid representation, another assumption is made. If A is 1 and T is 0 then the SNP for the individual is homozygous A [AA]. Another row is added to expressly represent if a SNP is homozygous, represented as ‘m’ in this example.

A1110001
T1101110
m0011111

Adding information from two more individuals, the SNP is discovered to be triallelic. [AT, AT, AA, TT, TT, TT, AA, TC, TC].

A111000100
T110111011
C000000011
G000000000
m001111100

If instead we have partial deletion, this can also be expressed. [AT, AT, AA, TT, TT, TT, AA, T_, T_]

A111000100
T110111011
C000000000
G000000000
m001111100
Limitations

There are some limitations to this approach. This includes increasing the dimensions of the data (2D to a 3D matrix or 1D array to a 2D matrix), which makes calculating genetic PCs more complex. As well as changing how GWAS is performed and interpreted. One of the more complex problems is converting raw sequencing data into a binary encoded format. These limitations can be addressed, and I’ll discuss them further later.

However, the limitations of the current [0,1,2] format must also be considered. It only has 3 possible states, 4 if including NA, compared to the binary formats possible 10. Also, the format does not represent insertions or deletions unambiguously. These issues arise from the use of Mendelian genetics, which defines only two possible variants for a given gene.


Posted

in

by

Comments

Leave a Reply