I have a dataset (matrix) of genotypes at different marker loci.
structure(list(Genotype = 1:5, Locus_1a = c(0L, 1L, 1L, 1L, 1L ), Locus_1b = c(1L, 1L, 0L, 1L, 0L), Locus_1c = c(1L, 0L, 1L, 1L, 0L), Locus_2a = c(0L, 1L, 1L, 0L, 0L), Locus_2b = c(1L, 1L, 0L, 1L, 1L), Locus_2c = c(1L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))
I am trying to convert the dataset from 0/1 notation to the following:
Genotype | Locus_1_1 | Locus_1_2 | Locus_1_3 |
---|---|---|---|
1 | b | c | |
2 | a | b | |
3 | a | c | |
4 | a | b | c |
5 | a | a |
Where most genotypes (1 through 3, in this example) are diploid (2n) and have two distinct alleles, that is represented as a string at the end of the column name.
Individual 4 is a triploid (3n) individual and has three distinct alleles.
Individual 5 is a diploid (2n) individual, but is homozygous for a single allele (Locus_1a) but should have it presented twice in the dataset.
The data are presented with the marker locus name in columns, with a variable string at the end based upon which allele is detected for that individual (a, b, c, etc.).
I'm not exactly sure how to execute the code for this problem to solve this task.