Optimizing R sapply function that is run on large data frame

Question

I had a problem in R where what I wanted to do was if a condition was met with a row in column A I wanted to match the value of that row in column C, and find the last instance that the value appeared in column B, and then add a number to that row for column D. I found a solution, but it's very slow when calculated on a data-frame with a few million rows even when I use a parallel version of my original code it takes ~30 mins to complete. What can I do to speedup this code, or is there a faster alternative function that accomplishes the same thing? Here is the parallel code I currently have:

x = which(df$a == 4)
y = df$c[which(df$a == 4)]
clusterExport(cl, "df")
clusterExport(cl, "x")
clusterExport(cl, "y")
z = parSapply(cl,seq_along(y), function(i) max(grep(y[i], df$b[1:x[i]])))
df$d[z[!is.infinite(z)]] = df$d[z[!is.infinite(z)]] + 3

[Sample data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), please. (Also helpful is [StackOverflow help/mcve](http://stackoverflow.com/help/mcve).) — r2evans, Apr 12 '17 at 23:43
With your parallelization, how much of the "too long" is spent in actual calculation, and how much of the time was spent copying the data to/from the parallel R instances? — r2evans, Apr 12 '17 at 23:45
One hack: include just enough data to give each node on your cluster `cl` exactly one dataset. Wrap `system.time(...)` around your inner function (capturing its results instead of the actual calculation ... acceptable for the moment), and wrap `system.time(...)` around the whole `parSapply` call. If data transfer is significant, you will see a decent difference in time between the whole (outer) time and the *max* of the inner times. (Admittedly I'm not a guru with `parSapply`, but conceptually I know that data copying can be a problem with parallel setups like these.) — r2evans, Apr 12 '17 at 23:54

Optimizing R sapply function that is run on large data frame

0 Answers0