0

I had a problem in R where what I wanted to do was if a condition was met with a row in column A I wanted to match the value of that row in column C, and find the last instance that the value appeared in column B, and then add a number to that row for column D. I found a solution, but it's very slow when calculated on a data-frame with a few million rows even when I use a parallel version of my original code it takes ~30 mins to complete. What can I do to speedup this code, or is there a faster alternative function that accomplishes the same thing? Here is the parallel code I currently have:

x = which(df$a == 4)
y = df$c[which(df$a == 4)]
clusterExport(cl, "df")
clusterExport(cl, "x")
clusterExport(cl, "y")
z = parSapply(cl,seq_along(y), function(i) max(grep(y[i], df$b[1:x[i]])))
df$d[z[!is.infinite(z)]] = df$d[z[!is.infinite(z)]] + 3
Community
  • 1
  • 1
grig109
  • 73
  • 1
  • 9
  • [Sample data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), please. (Also helpful is [StackOverflow help/mcve](http://stackoverflow.com/help/mcve).) – r2evans Apr 12 '17 at 23:43
  • With your parallelization, how much of the "too long" is spent in actual calculation, and how much of the time was spent copying the data to/from the parallel R instances? – r2evans Apr 12 '17 at 23:45
  • @r2evans I'm really not sure. How can I figure that out? – grig109 Apr 12 '17 at 23:50
  • One hack: include just enough data to give each node on your cluster `cl` exactly one dataset. Wrap `system.time(...)` around your inner function (capturing its results instead of the actual calculation ... acceptable for the moment), and wrap `system.time(...)` around the whole `parSapply` call. If data transfer is significant, you will see a decent difference in time between the whole (outer) time and the *max* of the inner times. (Admittedly I'm not a guru with `parSapply`, but conceptually I know that data copying can be a problem with parallel setups like these.) – r2evans Apr 12 '17 at 23:54
  • Could you add a small example of the data.frame? – spadarian Apr 13 '17 at 00:05
  • How many rows and columns do you have? – Uwe Sep 16 '17 at 15:07

0 Answers0