I have a data.frame that looks like this:
A C G T
1 6 0 14 0
2 0 0 20 0
3 14 0 6 0
4 14 0 6 0
5 6 0 14 0
(actually, I have 1800 of the with varying numbers of rows..)
Just to explain what you are looking at: Each row is one SNP, so it can either be one base (A,C,G,T) or another base (A,C,G,T) SNP1’s Major allele is “G”, which appears in 14 individuals, the minor allele is “A”, which appears in 6 out of the 20 individuals in the dataset. The 14 individuals that show G at SNP1 are the same the show A at SNP3, so there are two possibilities for the combination of bases along the 5 rows: one would be GGAAG and one would be AGGGA. These can (theoretically) be built from the colnames of all the cells containing either 6 or 14 in the corresponding row, resulting in something like this:
A C G T 14 6
1 6 0 14 0 G A
2 0 0 20 0 G G
3 14 0 6 0 A G
4 14 0 6 0 A G
5 6 0 14 0 G A
Is there an elegant way to achieve something like this? I have a piece of code from the answer to a somewhat related question that will return positions of a specific value within a matrix.
mat <- matrix(c(1:3), nrow = 4, ncol = 4)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 1
[2,] 2 3 1 2
[3,] 3 1 2 3
[4,] 1 2 3 1
find <- function(mat, value) {
nr <- nrow(mat)
val_match <- which(mat == value)
out <- matrix(NA, nrow= length(val_match), ncol= 2)
out[,2] <- floor(val_match / nr) + 1
out[,1] <- val_match %% nr
return(out)
}
find(mat, 2)
[,1] [,2]
[1,] 2 1
[2,] 1 2
[3,] 0 3
[4,] 3 3
[5,] 2 4
I think I can figure out how to adjust this to where it returns the colname from the original data.frame, but it requires the value it is looking for as input. – There are potentially several of those in one data snippet (as seen in the example above, 14 and 6), and it is/they are different for each snippet of my data. In some of them, there are no duplicates at all. In addition, if one of the values hits 20, then the corresponding colname is automatically the one to choose (as seen in row 2 on the example above).
EDIT I have tried the code suggested by thelatemail, and it works fine on some of the data, but not on all of them.
This one, for example, produces results that I don't fully understand: subset looks like this:
A C G T
1 0 0 3 1
2 0 9 0 3
3 3 0 0 2
4 0 3 0 2
5 2 0 0 3
6 0 2 0 3
sel <- subset > 0
ord <- order(row(subset)[sel], -subset[sel])
haplo1 <- split(names(subset)[col(subset)[sel]][ord], row(subset)[sel][ord])
This produces
1
[1] "G" "T"
2
[1] "C" "T"
3
[1] "A" "T"
4
[1] "C" "T"
5
[1] "T" "A"
6
[1] "T" "C"
Since there is a 3 in every row, I don't understand why these are not all in one of these possibilities (which would result in GTACTT and TCTTAC instead).
I have also realized that I have a lot of missing alleles, were only one or two individuals were found to have a base in this locis. Can a column with "missing" be included somehow? - I tried to just tack it on, which gave me an error about non-corresponding row numbers.