2

While working with R I encountered a strange problem: I am processing date in the follwing manner: Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB.

Somehting like this: https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html

For small data sets this works like a charm, but with lager ones (over about 100000 entries) I do get the "R Session Aborted" screen from R-Studio and the nativ R GUI just stops execution and implodes. There is no information in every log file that I've look into. I suspect that it is some kind of (leaking) memory issue.

As a work around I'm processing the data in chunks with a for-loop. But no matter how small the chunk size is, I do get the "R Session Aborted" screen, which looks a lot like leaking memory. The whole date consist of about 5 million rows.

I've looked a lot into packages like ff, the big-Family and matter basically everything from https://cran.r-project.org/web/views/HighPerformanceComputing.html but this dose not seem to work well with tibbles and the tidyverse way of data processing.

So, how can I improve my scrip to work with massiv amounts of data? How can I gather clues about why the R Session is Aborted?

Someone2
  • 421
  • 2
  • 15
  • what machine are you using? try a bigger one ;) – Roman Jul 25 '18 at 11:31
  • I don't think that this will help, since the memory (8 GB) is always only halve full. – Someone2 Jul 25 '18 at 11:33
  • Sounds like you might get some mileage out of `data.table` -- the [Getting Started](https://github.com/Rdatatable/data.table/wiki/Getting-started) page has a good set of resources. In his answer on the canonical question [data.table vs dplyr:](https://stackoverflow.com/a/27840349/7421656) , Hadley states in regard to dplyr's design the following: _"Memory and performance: I've lumped these together, because, to me, they're not that important"_ . The bottom line is that if you try to work with large data sets, `dplyr`, `tibbles`, and other "tidy" constructs will hamstring you. – Matt Summersgill Jul 25 '18 at 12:12
  • Thanks for the insight, but how can I perform such complex data manipulation without the `tidy` universe? – Someone2 Jul 25 '18 at 12:23

2 Answers2

1

Check out the article at:

datascience.la/dplyr-and-a-very-basic-benchmark

There is a table that shows runtime comparisons for some of the data wrangling tasks you are performing. From the table, it looks as though dplyr with data.table behind it is likely going to do much better than dplyr with a dataframe behind it.

There’s a link to the benchmarking code used to make the table, too.

In short, try adding a key, and try using data.table over dataframe.

To make x your key, and say your data.table is named dt, use setkey(dt,x).

Pake
  • 968
  • 9
  • 24
  • I'm quite new to R so I need to have a further look into data.table itself, thank you for the hint. – Someone2 Jul 25 '18 at 13:02
  • The sample code used for benchmarking might be helpful as an intro on how to perform different manipulations using all the different setups. https://github.com/szilard/benchm-dplyr-dt/blob/master/bm-n100m-m100.md – Pake Jul 25 '18 at 13:04
0

While Pakes answer deals with the described problem I found a solution to the underlying problem. For Compatibility reason I used R in the 3.4.3 version. Now I'm using the newer 3.5.1 version which works quite fine.

Someone2
  • 421
  • 2
  • 15