0

I have a big dataset that consists of:

  • 5M rows
  • 11 columns, two of which are categorical which has 8k+ levels each)
  • Also uses contr.sum to change the effects coding of the factor variables

Currently, biglm, ff and other big data packages/infrastructure are not an option due to project limitations, so I am stuck with 129 GB of RAM and can only do lm package to get the coefficient estimates and p-values.

Given that I have only loaded the table needed for the regression and performed gc() prior to running LM, each time I run a regression, R estimates how much memory it will allocate just to build the pseudo-matrix before performing the regression, hence the popular "Cannot allocate vector of size XX" error. The problem is, let's say I have reduced the levels of factors to 2k each and reduced the rows to 1.5M only (through manual subsetting and optimal allocation sampling), the regression runs for 1.5 days (still no parallelization) but it failed to write the LM model because there was no remaining memory for the output.

I'd like to ask how R stores each variable in RAM - numerical, integer, factor, character, etc.

For example, an arbitrary dimension of a data.frame, when I load it in R:

  • the object size is around 50 MB
  • when I save it as RDS, it compresses to 5 MB
  • but when I save it at as RData, it increases to almost 300 MB.

I'd like to understand how can I estimate how much memory will LM need to run the regression and store the LM object. And also, how R stores it in RAM and why is it creating so much overhead.

FoxyReign
  • 57
  • 1
  • 8
  • You could try to increase the memory limit: `memory.limit(size=50000)` for example would set the absolute memory limit that R can use to 50000 MB. The maximum limit depends on whether you run 32 or 64 bit R(see: `help(memory.limit)` ). The function `felm()` in package `lfe` might be good if you have so many levels in your factor variables, but if you can only use `lm()`.... – Alias Apr 24 '17 at 07:56
  • I would love to meet the person who will do the interpretation of those two 8000 level variables. – Roman Luštrik Apr 24 '17 at 07:56
  • Could you clarify why `lm` usage is a must? `lm` is rather suboptimal in terms of RAM usage. – Gregory Demin Apr 24 '17 at 13:58
  • Part of the requirement is to run the algorithm out of the box and automated, hence the 8k levels. Also, lm is needed because that's how the lead statistician designed the solution. I probably could use biglm and manipulate the estimates to mimic deviation coding rather than the dummy coding. – FoxyReign Apr 25 '17 at 00:05

0 Answers0