abstract
The current representation of genetic data as [0,1,2] poses some key limitations on interpretation and analysis. The proposed solution treats the genetic data as a categorical variable belonging to the following categories; [A,T,C,G,m]. The categories represent at least one nucleotide for a given SNP, and the ‘m’ category represents whether the SNP is homozygous for a given individual. The proposed format will fix some of the limitations inherent to the current representation of genetic data.
limitations of the current representation of genetic data
The first problem with the [0,1,2] format is the inability to accurately capture in/del elements. This is because of the ambiguity of what ‘0’ means. As it could mean no SNP is detected (deletion) or it is a homozygous SNP. Replacing instances where no SNP is detected with NA, NaN, or NULL can result in a broken analysis.
The second issue is the inability to capture triallelic variants successfully. Because [0,1,2] encoding only represents 3 options, homozygous [0,2] and heterozygous [1]. For rare outcomes where triallelic variants might be associated with the outcome, the third variant is effectively censured.
binary encoding with examples
The proposed solution is to have the data more closely resemble the biological data. To start, let’s consider representing nucleic acids as categorical variables [A,T,C,G] and a given SNP for seven individuals [AT, AT, AA, TT, TT, TT, AA]. Now, for encoding, at least one copy of a nucleotide will be represented as 1, and no copies as 0. The Given SNP is then represented as
A | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
T | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
If we want to reconstruct the nucleic acid representation, another assumption is made. If A is 1 and T is 0 then the SNP for the individual is homozygous A [AA]. Another row is added to expressly represent if a SNP is homozygous, represented as ‘m’ in this example.
A | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
T | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
m | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
Adding information from two more individuals, the SNP is discovered to be triallelic. [AT, AT, AA, TT, TT, TT, AA, TC, TC].
A | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
T | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
m | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
If instead we have partial deletion, this can also be expressed. [AT, AT, AA, TT, TT, TT, AA, T_, T_]
A | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
T | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
m | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
Limitations
There are some limitations to this approach. This includes increasing the dimensions of the data (2D to a 3D matrix or 1D array to a 2D matrix), which makes calculating genetic PCs more complex. As well as changing how GWAS is performed and interpreted. One of the more complex problems is converting raw sequencing data into a binary encoded format. These limitations can be addressed, and I’ll discuss them further later.
However, the limitations of the current [0,1,2] format must also be considered. It only has 3 possible states, 4 if including NA, compared to the binary formats possible 10. Also, the format does not represent insertions or deletions unambiguously. These issues arise from the use of Mendelian genetics, which defines only two possible variants for a given gene.
Leave a Reply
You must be logged in to post a comment.