Optimizing R code for ETL

Question

I have both an R script and a Pentaho (PDI) ETL transformation for loading data from a SQL database and performing a calculation. The initial data set has 1.28 million rows of 21 variables and is equivalent in both R and PDI. In fact, I originally wrote the R code and then subsequently "ported" to a transformation in PDI.

The PDI transformation runs in 30s (and includes an additional step of writing the output to a separate DB table). The R script takes between 45m and one hour total. I realize that R is a scripting language and thus interpreted, but it seems like I'm missing some optimization opportunities here.

Here's an outline of the code:

Read data from a SQL DB into a data frame using sqlQuery() from the RODBC package (~45s)
str_trim() two of the columns (~2 - 4s)
~~split() the data into partitions to prepare for performing a quantitative calculation (separate function) (~30m)~~
run the calculation function in parallel for each partition of the data using parLapply() (~15-20m)
~~rbind the results together into a single resulting data frame (~10 - 15m)~~

I've tried using ddply() instead of split(), parLapply() and rbind(), but it ran for several hours (>3) without completing. I've also modified the SQL select statement to return an artificial group ID that is the dense rank of the rows based on the unique pairs of two columns, in an effort to increase performance. But it didn't seem to have the desired effect. I've tried using isplit() and foreach() %dopar%, but this also ran for multiple hours with no end.

The PDI transformation is running Java code, which is undoubtedly faster than R in general. But it seems that the equivalent R script should take no more than 10 minutes (i.e. 20X slower than PDI/Java) rather than an hour or longer.

Any thoughts on other optimization techniques?

update: step 3 above, split(), was resolved by using indexes as suggested here Fast alternative to split in R

update 2: I tried using mclapply() instead of parLapply(), and it's roughly the same (~25m).

update 3: rbindlist() instead of rbind() runs in under 2s, which resolves step 5

You're only going to get very vague suggestions (e.g. "use data.table") unless you provide a specific piece of code that's running slowly. — joran, Nov 04 '13 at 20:42
I've previously posted a related question about the performance of `split()` specifically here: http://stackoverflow.com/questions/19038140/fast-alternative-to-split-in-r — argoneus, Nov 04 '13 at 20:53
And you got pretty good feedback, where it was demonstrated that `split` on 250k categories is not slow. The answer there suggested that perhaps you were failing to convert factors to character (or something else) resulting in _way_ more than 250k unique IDs. So I consider your point (3) addressed. — joran, Nov 04 '13 at 20:59
Yes, using `split()` for indexes did the trick for step 3. I'll investigate the other pieces to see if there are other opportunities for perf improvements. Thanks for the comment. — argoneus, Nov 04 '13 at 21:59

Optimizing R code for ETL

0 Answers0