I've got two data frames I want to compare: If a specific location in both data frames meet a requirement assign "X" to that specific location in a seperate data frame.
How can I get the expected output in an efficient way? The real data frame
contains 1000 columns with thousands to millions of rows.
I think data.table
would be the quickest option, but I don't have a grasp of how data.table
works yet
Expected output:
> print(result)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "A" "A" "O" "X" "X" "X" "X" "O" "O"
# [2,] "A" "A" "O" "X" "X" "X" "X" "O" "O"
# [3,] "A" "A" "O" "X" "X" "X" "X" "O" "X"
My code:
df1 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 3, 3, 3, 2, 0, 1), .Dim = c(3L, 9L), .Dimnames = list(
c("A", "B", "C"), NULL))
df2 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 1, 3, 3, 4, 4, 2), .Dim = c(3L, 9L), .Dimnames = list(
c("A", "B", "C"), NULL))
result <- matrix("O", nrow(df1), ncol(df1))
for (i in 1:nrow(df1))
{
for (j in 3:ncol(df1))
{
result[i,1] = c("A")
result[i,2] = c("A")
if (is.na(df1[i,j]) || is.na(df2[i,j])){
result[i,j] <- c("N")
}
if (!is.na(df1[i,j]) && !is.na(df2[i,j]) && !is.na(df2[i,j]))
{
if (df1[i,j] %in% c("0","1","2") & df2[i,j] %in% c("0","1","2")) {
result[i,j] <- c("X")
}
}
}
}
print(result)
Edit
I like both @David's and @Heroka's solutions. On a small dataset, Heroka's solution is 125x as fast as the original, and David's is 29 times as fast. Here's the benchmark:
> mbm
Unit: milliseconds
expr min lq mean median uq max neval
original 1058.81826 1110.481659 1131.81711 1112.848211 1124.775989 1428.18079 100
Heroka 8.46317 8.711986 9.03517 8.914616 9.067793 18.06716 100
DavidAarenburg() 35.58350 36.660565 39.85823 37.061160 38.175700 53.83976 100
Thanks alot guys!