0

Basically a followup on this question.

I'm still trying to get a grasp of R's vectorising while trying to speed up a coworkers' code. I've read R inferno and Speed up the loop operation in R.

My aim is to speed up the following code, the complete dataset contains ~1000columns by 10.000-1.000.000 rows:

df3 <- structure(c("X", "X", "X", "X", "O", "O", "O", "O", "O", "O", 
"O", "O", "O", "O", "O", "O"), .Dim = c(2L, 8L), .Dimnames = list(
    c("1", "2"), c("pig_id", "code", "DSFASD32", "SDFSD56", 
    "SDFASD12", "SDFSD56342", "SDFASD12231", "SDFASD45442"
    )))

score_1 <- structure(c(0, 0, 0, 0, 0, 0), .Dim = 2:3)


for (i in 1:nrow(df3)) {
  a<-matrix(table(df3[i,3:ncol(df3)]))

  if (nrow(a)==1) {
    score_1[i,1]<-0    #count number of X (error), N (not compared) and O (ok)
    score_1[i,2]<-a[1,1]
  }
  if (nrow(a)==2) {
    score_1[i,1]<-a[1,1]
    score_1[i,2]<-a[2,1]
  }
  if (nrow(a)==3) {
    score_1[i,1]<-a[1,1]
    score_1[i,2]<-a[2,1]
    score_1[i,3]<-a[3,1]
  }                        
}
colnames(score_1) <- c("N", "O", "X")

I have been trying myself but can't seem to figure it out yet. Here is what I've tried. It shows the same output as the code above, but I'm not sure if it actually does the same. I'm missing that bit of insight in R and my data set.

I can't seem to get my code get the same output as the for loop.


Edit: In response to Heroka's response I've updated my reproducible example:

Output of the for loop:

     [,1] [,2] [,3]
[1,]    0    6    0
[2,]    0    6    0

output of the apply function:

     1 2
[1,] 6 6
Community
  • 1
  • 1
Bas
  • 1,066
  • 1
  • 10
  • 28
  • Can you write down in words what you want to do here? – Heroka Dec 03 '15 at 08:41
  • @Heroka, to be honest.. I don't know. This is a coworkers' code I want to improve. I tried to make sense of the `matrix(table())` but I don't know what it's supposed to do exactly – Bas Dec 03 '15 at 08:44
  • 1
    that makes it quite difficult to find a solution for you, I don't really want to go through all the code trying to figure out what happens. It looks like a kind of rowwise count of "X", "O", and "N"(not in example data). Is that correct? – Heroka Dec 03 '15 at 08:49
  • 1
    Does `t(apply(df3[,-c(1:2)],1,table))` do what you want? – Heroka Dec 03 '15 at 08:52
  • @Heroka Yes, it is a rowwise count of "X", "O" and "N". The apply gets really close to what I need! almost there:) I've updated my question to contain a reproducible example. – Bas Dec 03 '15 at 10:59
  • 1
    your loop 'error's when `i = 3`. And `t(apply(df3, 1, table))` gives a rowwise count of the letters – tospig Dec 03 '15 at 11:02
  • @tospig I see, fixed the code. and the `t(apply(df3[,-c(1:2)], 1, table))` does count the columns, however I need it in the same format as the output of the for-loop || Off topic: your name is an anagram of the company I do my internship for! – Bas Dec 03 '15 at 11:16
  • 1
    @Bas I don't think the provided code is necessarily correct. For instance if there's only one unique character in a row, it gets written to the second results column. Irrespective of which character that is. There might be some assumptions/regularities in the data here, but it would make me very nervous. – Heroka Dec 03 '15 at 14:29

1 Answers1

2

This gives you the desired result in the table due to a conversion to a factor (forcing other letters to be zero), but is less computationally efficient than just using apply and table.

res <- t(apply(df3[,-c(1:2)],1,function(x){
  x_f=factor(x, levels=c("N","O","X"))
  return(table(x_f))
}))

> res
  N O X
1 0 6 0
2 0 6 0

For a smaller dataset melting the data first might be an option, but with 1e6 rows and 100 columns you'd need a lot of memory.

Heroka
  • 12,889
  • 1
  • 28
  • 38