1

I have a data.frame that looks like this:

   A C  G T
1  6 0 14 0
2  0 0 20 0
3 14 0  6 0
4 14 0  6 0
5  6 0 14 0

(actually, I have 1800 of the with varying numbers of rows..)

Just to explain what you are looking at: Each row is one SNP, so it can either be one base (A,C,G,T) or another base (A,C,G,T) SNP1’s Major allele is “G”, which appears in 14 individuals, the minor allele is “A”, which appears in 6 out of the 20 individuals in the dataset. The 14 individuals that show G at SNP1 are the same the show A at SNP3, so there are two possibilities for the combination of bases along the 5 rows: one would be GGAAG and one would be AGGGA. These can (theoretically) be built from the colnames of all the cells containing either 6 or 14 in the corresponding row, resulting in something like this:

   A C  G T 14 6
1  6 0 14 0  G A
2  0 0 20 0  G G
3 14 0  6 0  A G
4 14 0  6 0  A G
5  6 0 14 0  G A

Is there an elegant way to achieve something like this? I have a piece of code from the answer to a somewhat related question that will return positions of a specific value within a matrix.

    mat <- matrix(c(1:3), nrow = 4, ncol = 4)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    1
[2,]    2    3    1    2
[3,]    3    1    2    3
[4,]    1    2    3    1

    find <- function(mat, value) {
      nr <- nrow(mat)
      val_match <- which(mat == value)
      out <- matrix(NA, nrow= length(val_match), ncol= 2)
      out[,2] <- floor(val_match / nr) + 1
      out[,1] <- val_match %% nr
      return(out)
    }

    find(mat, 2)

     [,1] [,2]
[1,]    2    1
[2,]    1    2
[3,]    0    3
[4,]    3    3
[5,]    2    4

I think I can figure out how to adjust this to where it returns the colname from the original data.frame, but it requires the value it is looking for as input. – There are potentially several of those in one data snippet (as seen in the example above, 14 and 6), and it is/they are different for each snippet of my data. In some of them, there are no duplicates at all. In addition, if one of the values hits 20, then the corresponding colname is automatically the one to choose (as seen in row 2 on the example above).

EDIT I have tried the code suggested by thelatemail, and it works fine on some of the data, but not on all of them.

This one, for example, produces results that I don't fully understand: subset looks like this:

    A C G T
 1  0 0 3 1
 2  0 9 0 3
 3  3 0 0 2
 4  0 3 0 2
 5  2 0 0 3
 6  0 2 0 3

    sel <- subset > 0
    ord <- order(row(subset)[sel], -subset[sel])
    haplo1 <- split(names(subset)[col(subset)[sel]][ord], row(subset)[sel][ord])

This produces

 1
 [1] "G" "T"
 2
 [1] "C" "T"
 3
 [1] "A" "T"
 4
 [1] "C" "T"
 5
 [1] "T" "A"
 6
 [1] "T" "C"

Since there is a 3 in every row, I don't understand why these are not all in one of these possibilities (which would result in GTACTT and TCTTAC instead).

I have also realized that I have a lot of missing alleles, were only one or two individuals were found to have a base in this locis. Can a column with "missing" be included somehow? - I tried to just tack it on, which gave me an error about non-corresponding row numbers.

Community
  • 1
  • 1
  • Wouldn't there be a host of other possibilities - e.g. "AGAAA", "GGAAA", "AGGAA" .... "AAAAA" or "GGGGG" ? – thelatemail Oct 12 '16 at 04:05
  • Theoretically, yes. However, a lot of SNPs, especially when close together (all of the ones within one snippet of data are within 64bp of each other), are in linkage, which means that it is possible to construct runs of alleles that appear only together (haplotypes). The beauty of this particular dataset is that it is from honeybee drones, who are haploid (as opposed to "normal" female diploid workers/queens), so there is only one of the two possible main alleles in each individual and what I see is actually what is inherited in linkage. – Gertje Petersen Oct 12 '16 at 22:59
  • Why would you want to pick the genotype based on 3's? You are calling each locus based on the number of genotypes at that locus. You only have 4 people genotyped on row 1, but you have 12 people genotyped on row 2. Including NA's won't change the results of picking the most frequent and least frequent allele. There are software packages in R on bioconductor that will do variant calling and imputation that you may want to look into. – akaDrHouse Oct 13 '16 at 13:19
  • @akaDrHouse, thanks for your input. Sorry I didn't reply before, I took some time to try and (once more) hunt down existing methodology to handle this problem. Since this is Genotyping-by-Sequencing data and I would like to align to the reference genome, my choices are somewhat limited, and none include the construction of these.. local haplotypes. So far I have not been able to find something that ticks all my boxes, which is why I am subjecting myself to trying to do this myself, albeit crudely. ;) – Gertje Petersen Oct 17 '16 at 02:44
  • @akaDrHouse again: I was trying to keep runs of the same number of counts for a base together because the genotyped individuals are haploid honeybee drones, who undergo purifying selection, meaning that some haplotypes aren't viable in drones, skewing Major and Minor alleles from the overall population. - That aside, I simply don't understand how all the 3s do not appear in one column code-wise, if it is putting all numbers in the rows in order. – Gertje Petersen Oct 17 '16 at 02:57
  • @GertjePetersen Because 3 is not always the minimum or always the maximum. – akaDrHouse Oct 17 '16 at 13:39
  • @akaDrHouse Ah, I had my thinking about what the order() function does tied in a knot. Thanks. – Gertje Petersen Oct 17 '16 at 21:01

2 Answers2

1

In order to get my minimum function to work, I had to covert zero's to NA. For some reason, na.rm=TRUE doesn't work with which.min

See if this is helpful for you:

A <- c(6,0,14,14,6)
C <- c(0,0,0,0,0)
G <- c(14,20,6,6,14)
T <- c(0,0,0,0,0)
mymatrix <- as.matrix(cbind(A,C,G,T))
mymatrix<-ifelse(mymatrix==0,mymatrix==NA,mymatrix)
mymatrix

major_allele <- colnames(mymatrix)[apply(mymatrix,1,which.max)] ; head(major_allele)
minor_allele <- colnames(mymatrix)[apply(mymatrix,1,which.min)] ; head(minor_allele)

myds<-as.data.frame(cbind(mymatrix,major_allele,minor_allele))
myds


> myds
     A    C  G    T major_allele minor_allele
1    6 <NA> 14 <NA>            G            A
2 <NA> <NA> 20 <NA>            G            G
3   14 <NA>  6 <NA>            A            G
4   14 <NA>  6 <NA>            A            G
5    6 <NA> 14 <NA>            G            A
akaDrHouse
  • 2,190
  • 2
  • 20
  • 29
1

Here's an attempt that will work for however many hits there are in each row. It returns a list object, which is probably appropriate for differing lengths of results per row.

sel <- dat > 0
ord <- order(row(dat)[sel], -dat[sel])
split(names(dat)[col(dat)[sel]][ord], row(dat)[sel][ord] )
#List of 5
# $ 1: chr [1:2] "G" "A"
# $ 2: chr "G"
# $ 3: chr [1:2] "A" "G"
# $ 4: chr [1:2] "A" "G"
# $ 5: chr [1:2] "G" "A"

Where dat was:

dat <- read.table(text="
   A C  G T
1  6 0 14 0
2  0 0 20 0
3 14 0  6 0
4 14 0  6 0
5  6 0 14 0
", header=TRUE)
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thank you very much. It took me a while to figure out what exactly that split does, but unfortunately, I tried it on another snippet of data and it doesn't return the same results and I can't really figure out why. A C G T 1 0 0 3 1 2 0 9 0 3 3 3 0 0 2 4 0 3 0 2 5 2 0 0 3 6 0 2 0 3 – Gertje Petersen Oct 13 '16 at 02:01
  • Argh. Sorry for that. I've added the new data example and my resulting problem to the original question. – Gertje Petersen Oct 13 '16 at 02:02
  • @GertjePetersen - assuming your data in the comments splits into rows starting 1... 6, I can get my code to work as described above. – thelatemail Oct 13 '16 at 02:33
  • It does, I have added it above. – Gertje Petersen Oct 13 '16 at 02:38
  • @GertjePetersen - all my code does is take the >0 cells in each row and puts them in order from greatest to least. I'm not sure how the output is supposed to be to get your 2 sets. I don't know anything about biochemistry, so I may be overlooking how this ordering is meant to happen. – thelatemail Oct 13 '16 at 02:46
  • I don't think it has anything to do with biochemistry, more with stubbornness on my side, and having gone through it again, I think I got stuck in my own head and there is no problem. Now, I want to run your code over a list of lots of little dataframes in one big list. I'm trying to use lapply and adjusting the code accordingly (like this: sel <- lapply(subset,function(x) {x[x>0]}) - but I can only get this first line to work, after that it throws up errors because I don't have the right object-type as input. - Can you help me with this as well? – Gertje Petersen Oct 17 '16 at 03:56