R memory efficient way to store many data frames?

Question

I am reading several files together into a list of data frames to be able to apply functions to the combined data, but I am running into memory allocation problems when I have too many data frames ("Error: R cannot allocate memory").

e.g. variable number of data frames read, lets say for now 3 data frames:

x = data.frame(A=rnorm(100), B=rnorm(200))
y = data.frame(A=rnorm(30), B=rnorm(300))
z = data.frame(A=rnorm(20), B=rnorm(600))
listDF <- list(x,y,z)

Error: R cannot allocate memory

I was wondering whether someone here knows whether for example an [ array or one single data frame with many columns ] would be a more efficient way of storing and manipulating data frames.

The list of data frames is a very practical way because I can manipulate the many columns in the data based on the name of the data frame, when dealing with a variable number of data frames this is convenient. Anyway, if there are any ideas/any ways you like doing this, please share them :) Thank you!

Check out this question http://stackoverflow.com/questions/11486369/growing-a-data-frame-in-a-memory-efficient-manner — Daniel, Aug 29 '16 at 21:14
Making the actual code you're executing more memory efficient would require seeing the specific code. Broadly speaking, if you are hitting RAM limits one could (1) get a computer with significantly more RAM (e.g. something 'in the cloud'), or (2) keep the data on disk in files/db and only pull in a smaller chunk at a time for processing. — joran, Aug 29 '16 at 21:18
Thank you Daniel! I will try the data.table solution, the matrix I can't do because I have both columns with characters and numbers... Thanks!! — user971102, Aug 29 '16 at 21:19
I'm not sure how relevant Daniel's link is - it seems focused on adding data to a single data frame rather than having multiple data frames. It's your data frames that are taking up memory, and it doesn't much matter whether they are data frames in the global environment, data frames in a list, data tables in a list or the global environment, etc. The data itself is what is taking up memory space. — Gregor Thomas, Aug 29 '16 at 21:41
If there are many zeroes in the data, you could use sparse matrices. Otherwise do what @joran suggests: get more RAM or chunk your analysis. — dww, Aug 29 '16 at 22:23

score 1 · Answer 1 · answered Aug 30 '16 at 04:06

This solution may not be ideal as it isn't free, but Revolution R Enterprise is designed to deal with the problem of big data in R. It uses some of the data manipulation capabilities of SQL within R to do faster computations on big data. There is a learning curve as it has different functions to deal with the new data type, but if you are dealing with big data, the speed up is worth it. You just have to decide if the time to learn it and the cost of the product are more valuable to you than some of the slower and more klugie work arounds.

score -1 · Answer 2 · answered Aug 29 '16 at 21:52

-1

DataTables are very efficient data structures in R, take a look maybe they are useful for your case.

answered Aug 29 '16 at 21:52

Luis Leal

3,388
5
26
49

2

Data tables will *not* occupy less memory than data frames. In fact they are data frames but with addition class, and associated methods. Where data.table can save memory is during operations to add, delete, or modify columns, which it can do without copying – dww Aug 29 '16 at 22:29

Uwe · Accepted Answer · 2016-08-30T06:54:05.657

Your example and mentioning the apply family of functions suggest that the structure of the data frames is identical, ie, they all have the same columns.

If this is the case and if the total volume of data (all data frames together) still does fit in available RAM then a solution could be to pack all data into one large data.table with an extra id column. This can be achieved with function rbindlist:

library(data.table)
x <- data.table(A = rnorm(100), B = rnorm(200))
y <- data.table(A = rnorm(30), B = rnorm(300))
z <- data.table(A = rnorm(20), B = rnorm(600))
dt <- rbindlist(list(x, y, z), idcol = TRUE)
dt
      .id           A           B
   1:   1 -0.10981198 -0.55483251
   2:   1 -0.09501871 -0.39602767
   3:   1  2.07894635  0.09838722
   4:   1 -2.16227936  0.04620932
   5:   1 -0.85767886 -0.02500463
  ---                            
1096:   3  1.65858606 -1.10010088
1097:   3 -0.52939876 -0.09720765
1098:   3  0.59847826  0.78347801
1099:   3  0.02024844 -0.37545346
1100:   3 -1.44481850 -0.02598364

The rows originating from the individual source data frames can be distinghuished by the .id variable. All the memory efficient data.tableoperations can be applied on all rows, selected rows (dt[.id == 1, some_function(A)]) or group-wise (dt[, another_function(B), by = .id]).

Although the data.table operations are memory efficient, RAM might still be a limiting factor. Use the tables() function to monitor memory consumption of all created data.table objects:

tables()
     NAME  NROW NCOL MB COLS    KEY
[1,] dt   1,100    3  1 .id,A,B    
[2,] x      200    2  1 A,B        
[3,] y      300    2  1 A,B        
[4,] z      600    2  1 A,B        
Total: 4MB

and remove objects from memory which are no longer needed

rm(x, y, z)
tables()
     NAME  NROW NCOL MB COLS    KEY
[1,] dt   1,100    3  1 .id,A,B    
Total: 1MB

The memory use you report here is misleading, and is an artefact of using only small data tables. Try it with `x = y = z = data.table(A = rnorm(1e6), B = rnorm(1e6))` and you will see that the combined table is *larger* not smaller than the sum of x, y, and z. — dww, Aug 29 '16 at 22:41
@dww This is correct as the the additional `.id` column does need additional memory. But that is not the point here. As you have pointed out in your comment to @Luis: "Where data.table can save memory is during operations to add, delete, or modify columns, which it can do without copying" — Uwe, Aug 29 '16 at 22:49
It also is not at all clear if OP's real data is actually of the same structure or it OP just made a convenient example. Even if they are combinable, while you are correct that using `data.table` can cut down on memory during processing, as dww says, putting them into one big data.table - which seems to be the crux of your answer - won't do anything to help them fit in memory in the first place. — Gregor Thomas, Aug 29 '16 at 23:01
@Gregor Thanks for your comments. I've amended my answer to emphasize the prerequisites. — Uwe, Aug 30 '16 at 07:01

R memory efficient way to store many data frames?

3 Answers3