I'm looking for a more efficient way to do some replacements/lookups.
My current method is using paste0
to create a lookup value, and then matching on that to filter.
Given x
,
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
var1 var2 val1 val2
1 AA -- 1 5
2 BB AA 2 5
3 CC AA 1 7
4 DD -- 4 8
var1
is the primary column and var2
is the secondary column. val1
and val2
are value columns.
If var2
is a value in var1
and the values match, we want to replace the stated val
with NA - and we want to do this independently for the value columns.
The way that I've come up with uses a lookup that loops over the columns and creates a lookup value essentially.
lookup.df <- x %>% filter(var2 == "--")
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
which does return what I would expect.
> x
var1 var2 val1 val2
1 AA -- 1 5
2 BB AA 2 NA
3 CC AA NA 7
4 DD -- 4 8
However, in practice, when profiling the code, the majority of the time is spent in the paste - and it just doesn't feel like the most efficient way to do this.
My real data set is millions of rows and about 25 columns, and runs in around 60 seconds. I'd think there'd be a way to do a logical matrix replacement instead of accessing each column individually. I can't figure it out though.
Any help is greatly appreciated. Thanks!
Edit -- benchmarks
na.replace.orig <- function(x) {
lookup.df <- x %>% filter(var2 == "--")
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
return(x)
}
# pulled out the lookup table since it causes a lot of overhead
na.replace.orig.no.lookup <- function(x) {
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
return(x)
}
na.replace.1 <- function(x) {
inx <- match(x$var2, x$var1)
jnx <- which(!is.na(inx))
inx <- inx[!is.na(inx)]
knx <- grep("^val", names(x))
for(i in seq_along(inx))
for(k in knx)
if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA
return(x)
}
na.replace.2 <- function(x) {
for(col in c("val1","val2")) {
x[x[,'var2'] %in% x[,'var1'] & x[,col] %in% lookup.df[,col] , col] <- NA
}
return(x)
}
> microbenchmark::microbenchmark(na.replace.orig(x), na.replace.orig.no.lookup(x), na.replace.1(x), na.replace.2(x), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
na.replace.orig(x) 1267.23 1274.2 1441.9 1408.8 1609.8 1762.8 10
na.replace.orig.no.lookup(x) 217.43 228.9 270.9 239.2 296.6 394.2 10
na.replace.1(x) 98.46 106.3 133.0 123.9 136.6 239.2 10
na.replace.2(x) 117.74 147.7 162.9 166.6 183.0 189.9 10
Edit - 3rd Variable Required
I realized that I have a 3rd variable I need to check against.
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
var3 = c("Y","Y","N","N"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
var1 var2 var3 val1 val2
1 AA -- Y 1 5
2 BB AA Y 2 5
3 CC AA N 1 7
4 DD -- N 4 8
with the expected result
var1 var2 var3 val1 val2
1 AA -- Y 1 5
2 BB AA Y 2 NA
3 CC AA N 1 7
4 DD -- N 4 8
My code still works for this case.
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2, x$var3, x[[column]])
var1.lookup <- paste0(lookup.df$var1, x$var3, lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})