1

I've got two data frames I want to compare: If a specific location in both data frames meet a requirement assign "X" to that specific location in a seperate data frame.

How can I get the expected output in an efficient way? The real data frame contains 1000 columns with thousands to millions of rows. I think data.table would be the quickest option, but I don't have a grasp of how data.table works yet

Expected output:

> print(result)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "O" 
# [2,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "O" 
# [3,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "X" 

My code:

df1 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 
            2, 2, 2, 2, 3, 3, 3, 2, 0, 1), .Dim = c(3L, 9L), .Dimnames = list(
              c("A", "B", "C"), NULL))
df2 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 
            2, 2, 2, 2, 1, 3, 3, 4, 4, 2), .Dim = c(3L, 9L), .Dimnames = list(
              c("A", "B", "C"), NULL))

result <- matrix("O", nrow(df1), ncol(df1))


for (i in 1:nrow(df1)) 
{
  for (j in 3:ncol(df1)) 
  {
    result[i,1] = c("A")
    result[i,2] = c("A")
    if (is.na(df1[i,j]) || is.na(df2[i,j])){
      result[i,j] <- c("N")
    }
    if (!is.na(df1[i,j]) && !is.na(df2[i,j]) && !is.na(df2[i,j]))
    {

      if (df1[i,j] %in% c("0","1","2") & df2[i,j] %in% c("0","1","2")) {
        result[i,j] <- c("X") 
      }
    }
  }
}   


print(result)

Edit

I like both @David's and @Heroka's solutions. On a small dataset, Heroka's solution is 125x as fast as the original, and David's is 29 times as fast. Here's the benchmark:

> mbm
Unit: milliseconds
             expr        min          lq       mean      median          uq        max neval
         original 1058.81826 1110.481659 1131.81711 1112.848211 1124.775989 1428.18079   100
           Heroka    8.46317    8.711986    9.03517    8.914616    9.067793   18.06716   100
 DavidAarenburg()   35.58350   36.660565   39.85823   37.061160   38.175700   53.83976   100

Thanks alot guys!

Bas
  • 1,066
  • 1
  • 10
  • 28
  • 2
    Those are not data.frames or data.tables, so I'm removing the tags. You're looking at matrices. These are distinct "classes" for objects in R. – Frank Nov 30 '15 at 14:54
  • 1
    I have edited the title to reflect this. – Heroka Nov 30 '15 at 15:05
  • 3
    Isn't this just `result[df1 < 3 & df2 < 3] <- "X" ; result[, 1:2] <- "A" ; result[is.na(df1) | is.na(df2)] <- "N"`? – David Arenburg Nov 30 '15 at 15:42

1 Answers1

4

You have matrices, not dataframes.

One approach might be to use ifelse (and %in% a numeric variable, saves about 50% of the time to avoid the time-conversion.:

  result <- ifelse(is.na(df1)|is.na(df2),"N",
                   ifelse(df1 %in% 0:2 & df2 %in% 0:2,"X","O"))
  result[,1:2] <- "A"
  result

With thanks to @DavidArenburg, more improvement in speed

result <- matrix("O",nrow=nrow(df1),ncol=ncol(df1))
result[is.na(df1) | is.na(df2)] <- "N"
result[df1 < 3 & df2 < 3] <- "X"
result[, 1:2] <- "A"
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • Works like a charm @Heroka! I'll wait with accepting your answer until later today in case answers with higher performance still show up! – Bas Nov 30 '15 at 15:18
  • 1
    Did you test this on a big data set on OPs mock data? – David Arenburg Nov 30 '15 at 17:04
  • 1
    OP's mock data. My solution seems to get relatively slower (order ~10) when df1 is 1000 cols/10000 columns. – Heroka Nov 30 '15 at 17:06
  • I don't have access to my computer at the moment! Will try it tomorrow! – Bas Nov 30 '15 at 20:33
  • @David and Heroka, I've added a benchmark for a small dataset to the OP. will get a hold of the complete dataset soon to update the benchmark – Bas Dec 01 '15 at 08:11
  • @David, Heroka, I've made a followup on this question, I know I'm close but I can't solve it quite yet. Maybe you guys can help!:) http://stackoverflow.com/questions/34018634/r-speeding-up-and-removing-if-statements-from-for-loop?lq=1 – Bas Dec 03 '15 at 07:26