0

I am fairly new with R and I wrote a piece of code to iterate over a data frame (cell by cell) to perform a 'simple' check by if...else statements but it turns out it is very slow.

here the code

for(m in 1:(ncol(df) - 1)){
    listmatch <- unlist(subset(dt, Experiment == m & Number == m)[[which(colnames(dt) == "id")]])
    for(n in 1:nrow(df)){
      if (is.na(df[n, m]) == TRUE){
        df[n, m] <- NA
      } else {
        if (sum(as.numeric(unlist(strsplit(as.character(df[n, 33]), split = ";"))) %in% listmatch, na.rm = TRUE) > 0){
          df[n, m] <- 1
        } else {
          df[n, m] <- NA
      }}
    next}
  next}

Just to give some dimensions, df is 80000 obs of 33 variable, listmatch is on average a vector of 5000 values, while the column 33 of df contains information such as

1;2;3;4;5;6;7;8;9;10;11;12;13...

on average 150 values separated by semicolumn. I do not understand why it is very slow, neither how to optimize it and make it faster. Is it more the nested for loops or the nested if...else statements that make it slow? Any suggestions?

Here an example of my data df

        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]     [,33]
 [1,]    NA    NA    NA    NA    NA     1    NA     1;8;9;6
 [2,]     1    NA    NA    NA    NA    NA    NA           5
 [3,]     1    NA    NA     1    NA    NA    NA           6
 [4,]     1    NA     1    NA     1     1     1   88;98;125
 [5,]    NA    NA    NA    NA    NA    NA    NA         2;3
 [6,]     1    NA    NA    NA    NA    NA    NA         4;5
 [7,]    NA    NA    NA    NA    NA    NA     1          44
 [8,]    NA    NA     1     1     1    NA    NA          46
 [9,]    NA     1     1    NA    NA    NA    NA          78
[10,]    11    NA    NA    NA    NA    NA    NA     147;411
[11,]    NA     1    NA    NA    NA    NA    NA         400
[12,]    NA     1     1    NA    NA    NA    NA         658
[13,]    NA     1    NA     1    NA    NA    NA 87;988;1226
[14,]    NA    NA    NA    NA     1     1    NA 121;232;321
[15,]    NA    NA     1    NA    NA     1    NA           5

while dt looks like this

           Text Experiment Number    id
 [1,]        df          5      3  1
 [2,]        fd          8      8  8
 [3,]       reh          7      4  6
 [4,]   egfdgsd          4      8  8
 [5,]  fbdsgryj          5      3  44
 [6,] rtryyukfn          7      2  411
 [7,]    fgsfde          5      6  400
 [8,] wthrthmgh          2      7  1274
 [9,]      vdfd          2      8  658
[10,]       vfr          8      6  147
[11,]     brtey          3      2  1226
[12,]     tyirt          7      5  988
[13,]       fyi          5      4  87
[14,]     ywrtj          7      5  46
[15,]      kyou          8      6  4
[16,]      pkjh          1      7  8
[17,]    retuyk          5      4  9
[18,]        re          7      8  11
[19,]     eryre          6      3  12
[20,]      fdhd          6      1  18

I forgot to mention that I would like to keep the code

listmatch <- unlist(subset(dt, Experiment == m & Number == m)[[which(colnames(dt) == "id")]])

as it is, at the moment I am filtering based on the value of m but I would like in the future to apply manually different filtering, i.e. ...Experiment == m*2 & Number == m*5)...

pisistrato
  • 375
  • 2
  • 14
  • 1
    can you give small data example and what are you trying to accomplish? because it is highly possible that you do not need to use loops... – minem Aug 29 '17 at 13:14
  • 1
    what is `dt`? You should provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – M-- Aug 29 '17 at 13:19
  • I have added an example of my data. What I would like to do is to check in `df` all the cells with 1 and see if at least one of the numbers in the 33th column is present in the column "id" of `dt` for the relative Experiment and Number, where Experiment and Number are the column number `m` of `df` – pisistrato Aug 29 '17 at 13:34
  • Shouldn't you be checking column 32 on `df`? Your example shows numbers collapsed with `;` on column 32 rather than 33. – Gabriel Mota Aug 29 '17 at 13:59
  • Sorry, made a typo in the example, it is column 33 the one with the collapsed values – pisistrato Aug 29 '17 at 14:16

0 Answers0