1

I am sure that I am just making a simple mistake.

I have a large matrix 3307592x9 that I need to iterate through and if column 8 (char/string) == 9 (char/string) (case insensitive) then column 3-7 (numeric 0-1) need to be 1-self. The code that I have written is:

for (i in 1:3307592){
    if(grepl(chr2SnpFreqNorm[i,8], chr2SnpFreqNorm[i,9], ignore.case=TRUE)){
        chr2SnpFreqNorm[i,3] <- 1 - chr2SnpFreqNorm[i,3]
        chr2SnpFreqNorm[i,4] <- 1 - chr2SnpFreqNorm[i,4]
        chr2SnpFreqNorm[i,5] <- 1 - chr2SnpFreqNorm[i,5]
        chr2SnpFreqNorm[i,6] <- 1 - chr2SnpFreqNorm[i,6]
        chr2SnpFreqNorm[i,7] <- 1 - chr2SnpFreqNorm[i,7]
    }
}

When I attempt to execute my R client just hangs for over half an hour before I cancel the command. I'm not sure what I have done wrong as the code looks correct to me.

/edit Example data

> chr2SnpFreqNorm[1:10,]
        ID   pos ceuChr2SnpFreq chsChr2SnpFreq lwkChr2SnpFreq
1  rs187078949 10133    0.070588235          0.000    0.030927835
2  rs191522553 10140    0.005882353          0.000    0.005154639
3  rs149483862 10286    0.100000000          0.135    0.226804124
4  rs150919307 10297    0.147058824          0.070    0.113402062
5  rs186644623 10315    0.000000000          0.000    0.000000000
6  rs193294418 10345    0.017647059          0.000    0.036082474
7  rs185496709 10386    0.082352941          0.020    0.087628866
8  rs188771313 10419    0.229411765          0.085    0.056701031
9  rs192945962 10425    0.100000000          0.020    0.015463918
10 rs184397180 10431    0.064705882          0.005    0.036082474
   tsiChr2SnpFreq yriChr2SnpFreq ALT AA
1     0.035714286    0.045454545   A  a
2     0.005102041    0.005681818   A  C
3     0.239795918    0.170454545   A  t
4     0.168367347    0.130681818   T  t
5     0.000000000    0.005681818   G  C
6     0.030612245    0.028409091   A  G
7     0.035714286    0.113636364   T  t
8     0.147959184    0.090909091   G  G
9     0.091836735    0.034090909   G  c
10    0.015306122    0.045454545   T  a

>
Thaddeus Aid
  • 169
  • 1
  • 1
  • 17
  • Your main mistake is using a `for` loop instead of a vectorized operation. (I would recommend package data.table for your data size.) It's also not clear to me, why you use `grepl`. A combination of `tolower` and `==` should be sufficient. If you [gave example data](http://stackoverflow.com/a/5963610/1412059) it would be easier to show you how to do this. – Roland Aug 31 '13 at 17:49
  • something likefixAncestor <- function(x){ if(tolower(x[8]) == tolower(x[9])){ x[3] <- 1 - x[3] x[4] <- 1 - x[4] x[5] <- 1 - x[5] x[6] <- 1 - x[6] x[7] <- 1 - x[7] } } – Thaddeus Aid Aug 31 '13 at 18:00
  • Your sample data is useless, because the condition is never fulfilled in it. – Roland Aug 31 '13 at 18:51
  • fixed the sample data – Thaddeus Aid Aug 31 '13 at 19:07
  • I've modified my answer to use it. – Roland Aug 31 '13 at 19:15

3 Answers3

2

First your data:

DF <- structure(list(ID = c("rs187078949", "rs191522553", "rs149483862", 
"rs150919307", "rs186644623", "rs193294418", "rs185496709", "rs188771313", 
"rs192945962", "rs184397180"), pos = c(10133L, 10140L, 10286L, 
10297L, 10315L, 10345L, 10386L, 10419L, 10425L, 10431L), ceuChr2SnpFreq = c(0.070588235, 
0.005882353, 0.1, 0.147058824, 0, 0.017647059, 0.082352941, 0.229411765, 
0.1, 0.064705882), chsChr2SnpFreq = c(0, 0, 0.135, 0.07, 0, 0, 
0.02, 0.085, 0.02, 0.005), lwkChr2SnpFreq = c(0.030927835, 0.005154639, 
0.226804124, 0.113402062, 0, 0.036082474, 0.087628866, 0.056701031, 
0.015463918, 0.036082474), tsiChr2SnpFreq = c(0.035714286, 0.005102041, 
0.239795918, 0.168367347, 0, 0.030612245, 0.035714286, 0.147959184, 
0.091836735, 0.015306122), yriChr2SnpFreq = c(0.045454545, 0.005681818, 
0.170454545, 0.130681818, 0.005681818, 0.028409091, 0.113636364, 
0.090909091, 0.034090909, 0.045454545), ALT = c("A", "A", "A", 
"T", "G", "A", "T", "G", "G", "T"), AA = c("a", "C", "t", "t", 
"C", "G", "t", "G", "c", "a")), .Names = c("ID", "pos", "ceuChr2SnpFreq", 
"chsChr2SnpFreq", "lwkChr2SnpFreq", "tsiChr2SnpFreq", "yriChr2SnpFreq", 
"ALT", "AA"), row.names = c("1", "2", "3", "4", "5", "6", "7", 
"8", "9", "10"), class = "data.frame")

And now a data.table solution:

#use data.table for excellent efficiency
library(data.table)
DT <- data.table(DF)

#subtract 1 from columns 3 to 7 if columns ALT and AA are equal (case insensitive)
DT[tolower(ALT)==tolower(AA), 3:7 := lapply(.SD, `-`, e2 = 1), .SDcols=3:7]

#              ID   pos ceuChr2SnpFreq chsChr2SnpFreq lwkChr2SnpFreq tsiChr2SnpFreq yriChr2SnpFreq ALT AA
#  1: rs187078949 10133   -0.929411765         -1.000   -0.969072165   -0.964285714   -0.954545455   A  a
#  2: rs191522553 10140    0.005882353          0.000    0.005154639    0.005102041    0.005681818   A  C
#  3: rs149483862 10286    0.100000000          0.135    0.226804124    0.239795918    0.170454545   A  t
#  4: rs150919307 10297   -0.852941176         -0.930   -0.886597938   -0.831632653   -0.869318182   T  t
#  5: rs186644623 10315    0.000000000          0.000    0.000000000    0.000000000    0.005681818   G  C
#  6: rs193294418 10345    0.017647059          0.000    0.036082474    0.030612245    0.028409091   A  G
#  7: rs185496709 10386   -0.917647059         -0.980   -0.912371134   -0.964285714   -0.886363636   T  t
#  8: rs188771313 10419   -0.770588235         -0.915   -0.943298969   -0.852040816   -0.909090909   G  G
#  9: rs192945962 10425    0.100000000          0.020    0.015463918    0.091836735    0.034090909   G  c
# 10: rs184397180 10431    0.064705882          0.005    0.036082474    0.015306122    0.045454545   T  a
Roland
  • 127,288
  • 10
  • 191
  • 288
  • I'm going to have to look up := – Thaddeus Aid Aug 31 '13 at 18:51
  • Just read the [data.table intro](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf) and [FAQ](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf). – Roland Aug 31 '13 at 18:52
1

for is not your friend in R, here's a solution using apply and conditional indexing

## create some toy data    
matrix( ncol=5, nrow = 100, c( runif(300), sample(c('A','G','C','T','a','c','g','t'), replace=T, 200))) -> data

flip_allele_freqs <- function(x) { 
## function will return 1-x on any x that looks like a number less than 1
    n = as.numeric(x)
    if ( is.na(n) ) { ## cant convert to numeric, must be str
        return(x)
    }
    if (n < 1) {
        return( 1 - n )
    } else {
        return x
    }
}

## apply the flip alleles function to the rows where the two last columns are equal
##fold the new data back into the old matrix

data[toupper(data[,5]) == toupper(data[,4]),] <- 
    apply(data[toupper(data[,5]) == toupper(data[,4]),], c(1,2), flip_allele_freqs)

Good luck with the GWAS!

qwwqwwq
  • 6,999
  • 2
  • 26
  • 49
1

In base R you can do simply

flip <- Vectorize(grepl)(chr2SnpFreqNorm[,8], chr2SnpFreqNorm[,9], ignore.case=TRUE)

chr2SnpFreqNorm[flip,3:7] <- 1 - chr2SnpFreqNorm[filp,3:7]

This may be a bit slow because Vectorize hides a loop. But, if all you need is to flip rows where columns 8 and 9 match exactly (except for case), then use this filter instead:

flip <- tolower(chr2SnpFreqNorm[,8])==tolower(chr2SnpFreqNorm[,9])
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69