2

I'm looking for a more efficient way to do some replacements/lookups.

My current method is using paste0 to create a lookup value, and then matching on that to filter.

Given x,

x <- data.frame(var1 = c("AA","BB","CC","DD"), 
                var2 = c("--","AA","AA","--"), 
                val1 = c(1,2,1,4), 
                val2 = c(5,5,7,8))

  var1 var2 val1 val2
1   AA   --    1    5
2   BB   AA    2    5
3   CC   AA    1    7
4   DD   --    4    8

var1 is the primary column and var2 is the secondary column. val1 and val2 are value columns.

If var2 is a value in var1 and the values match, we want to replace the stated val with NA - and we want to do this independently for the value columns.

The way that I've come up with uses a lookup that loops over the columns and creates a lookup value essentially.

lookup.df <- x %>% filter(var2 == "--")

x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
  var2.lookup <- paste0(x$var2,x[[column]])
  var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])

  x[[column]][var2.lookup %in% var1.lookup] <- NA

  return(x[[column]])
})

which does return what I would expect.

> x
  var1 var2 val1 val2
1   AA   --    1    5
2   BB   AA    2   NA
3   CC   AA   NA    7
4   DD   --    4    8

However, in practice, when profiling the code, the majority of the time is spent in the paste - and it just doesn't feel like the most efficient way to do this.

My real data set is millions of rows and about 25 columns, and runs in around 60 seconds. I'd think there'd be a way to do a logical matrix replacement instead of accessing each column individually. I can't figure it out though.

Any help is greatly appreciated. Thanks!

Edit -- benchmarks

na.replace.orig <- function(x) {
  lookup.df <- x %>% filter(var2 == "--")

  x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
    var2.lookup <- paste0(x$var2,x[[column]])
    var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])

    x[[column]][var2.lookup %in% var1.lookup] <- NA

    return(x[[column]])
  })

  return(x)
}

# pulled out the lookup table since it causes a lot of overhead
na.replace.orig.no.lookup <- function(x) {

  x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
    var2.lookup <- paste0(x$var2,x[[column]])
    var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])

    x[[column]][var2.lookup %in% var1.lookup] <- NA

    return(x[[column]])
  })

  return(x)
}

na.replace.1 <- function(x) {
  inx <- match(x$var2, x$var1)
  jnx <- which(!is.na(inx))
  inx <- inx[!is.na(inx)]
  knx <- grep("^val", names(x))

  for(i in seq_along(inx))
    for(k in knx)
      if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA

  return(x)
}

na.replace.2 <- function(x) {

  for(col in c("val1","val2")) {
    x[x[,'var2'] %in% x[,'var1'] & x[,col] %in% lookup.df[,col] , col] <- NA
  }

  return(x)
}

> microbenchmark::microbenchmark(na.replace.orig(x), na.replace.orig.no.lookup(x), na.replace.1(x), na.replace.2(x), times = 10)
Unit: microseconds
                         expr     min     lq   mean median     uq    max neval
           na.replace.orig(x) 1267.23 1274.2 1441.9 1408.8 1609.8 1762.8    10
 na.replace.orig.no.lookup(x)  217.43  228.9  270.9  239.2  296.6  394.2    10
              na.replace.1(x)   98.46  106.3  133.0  123.9  136.6  239.2    10
              na.replace.2(x)  117.74  147.7  162.9  166.6  183.0  189.9    10

Edit - 3rd Variable Required

I realized that I have a 3rd variable I need to check against.

x <- data.frame(var1 = c("AA","BB","CC","DD"), 
                var2 = c("--","AA","AA","--"),
                var3 = c("Y","Y","N","N"),
                val1 = c(1,2,1,4), 
                val2 = c(5,5,7,8))

  var1 var2 var3 val1 val2
1   AA   --    Y    1    5
2   BB   AA    Y    2    5
3   CC   AA    N    1    7
4   DD   --    N    4    8

with the expected result

  var1 var2 var3 val1 val2
1   AA   --    Y    1    5
2   BB   AA    Y    2   NA
3   CC   AA    N    1    7
4   DD   --    N    4    8

My code still works for this case.

x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
  var2.lookup <- paste0(x$var2, x$var3, x[[column]])
  var1.lookup <- paste0(lookup.df$var1, x$var3, lookup.df[[column]])

  x[[column]][var2.lookup %in% var1.lookup] <- NA

  return(x[[column]])
})
Shuo
  • 491
  • 1
  • 6
  • 16

2 Answers2

3

The following solution uses only vectorized logic. It uses the lookup table you already made. I think it'll be even faster than Rui's solution

library(dplyr)
x <- data.frame(var1 = c("AA","BB","CC","DD"), 
                var2 = c("--","AA","AA","--"), 
                val1 = c(1,2,1,4), 
                val2 = c(5,5,7,8))

lookup.df <- x[ x[,'var2'] == "--", ]

x[x[,'var2'] %in% x[,'var1'] & x[,'val1'] %in% lookup.df[,'val1'] , 'val1'] <- NA
x[x[,'var2'] %in% x[,'var1'] & x[,'val2'] %in% lookup.df[,'val2'] , 'val2'] <- NA

x
#>   var1 var2 val1 val2
#> 1   AA   --    1    5
#> 2   BB   AA    2   NA
#> 3   CC   AA   NA    7
#> 4   DD   --    4    8

EDIT:

It might be or it might not be.

set.seed(4)
microbenchmark::microbenchmark(na.replace.orig(x), na.replace.1(x), na.replace.2(x), times = 50)
#> Unit: microseconds
#>                expr     min      lq     mean   median      uq      max
#>  na.replace.orig(x) 184.348 192.410 348.4430 202.1615 223.375 6206.546
#>     na.replace.1(x)  68.127  86.621 281.3503  89.8715  93.381 9693.029
#>     na.replace.2(x)  95.885 105.858 210.7638 113.2060 118.668 4993.849
#>  neval
#>     50
#>     50
#>     50

OP, you'll need to test it on your dataset to see how the two scale differently at larger-sized dataframes.

Edit 2: Implemented Rui's suggestion for the lookup table. In order from slowest to fastest benchmark:

lookup.df <- x %>% filter(var2 == "--")
lookup.df <- filter(x, var2 == "--")
lookup.df <- x[x[,'var2'] == "--", ]
  • 1
    Upvote but a note: pipes are slower than indexing, try `lookup.df <- x[x$var2 == "--", ]`. – Rui Barradas Aug 29 '18 at 05:23
  • Thanks for the tip. Apparently, filter vs. subset is debated, but the pipe undoubtedly slows down the operation https://stackoverflow.com/questions/39882463/difference-between-subset-and-filter-from-dplyr https://stackoverflow.com/questions/35933272/why-is-using-dplyr-pipe-slower-than-an-equivalent-non-pipe-expression-for/35935105 – AcademicDialysis Aug 29 '18 at 16:31
1

I find the following solution a bit confusing (and I came up with it!) but it works.
And contrary to the popular belief, for loops are not much slower than the *apply family.

inx <- match(x$var2, x$var1)
jnx <- which(!is.na(inx))
inx <- inx[!is.na(inx)]
knx <- grep("^val", names(x))

for(i in seq_along(inx))
    for(k in knx)
        if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA

x
#  var1 var2 val1 val2
#1   AA   --    1    5
#2   BB   AA    2   NA
#3   CC   AA   NA    7
#4   DD   --    4    8
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66