Work in R with very large data set

Question

I am working with a very large data set which I am downloading from an Oracle data base. The Data frame has about 21 millions rows and 15 columns. My OS is windows xp (32-bit), I have 2GB RAM. Short-term I cannot upgrade my RAM or my OS (it is at work, it will take months before I get a decent pc).

library(RODBC)
sqlQuery(Channel1,"Select * from table1",stringsAsFactor=FALSE)

I get here already stuck with the usual "Cannot allocate xMb to vector". I found some suggestion about using the ff package. I would appreciate to know if anybody familiar with the ff package can tell me if it would help in my case. Do you know another way to get around the memory problem? Would a 64-bit solution help? Thanks for your suggestions.

may if the data schema permits you can put that table in hbase or hadoop in csv format and try R in map reduce fashion? see Rhadoop — FUD, Nov 01 '12 at 21:58
rent a computer from amazon for $2/hour, [these fine folks](http://www.bioconductor.org/help/bioconductor-cloud-ami/) have done all the hard work for you in setting things up. — Chase, Nov 01 '12 at 22:00

score 3 · Answer 1 · answered Nov 01 '12 at 22:09

In my experience, processing your data in chunks can almost always help greatly in processing big data. For example, if you calculate a temporal mean only one timestep needs to be in memory at any given time. You already have your data in a database, so obtaining the subset is easy. Alternatively, if you cannot easily process in chunks, you could always try and take a subset of your data. Repeat the analysis a few times to see if your results are sensitive to which subset you take. The bottomline is that some smart thinking can get you a long way with 2 Gb of RAM. If you need more specific advice, you need to ask more specific questions.

There is also the 'colbycol' package to support by-column handling. — IRTFM, Nov 01 '12 at 23:13

score 3 · Accepted Answer · 2012-11-05T10:14:19.990

3

If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.

In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) - in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM). Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.

The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time - especially if you are new to R.

edited Nov 05 '12 at 10:14

answered Nov 05 '12 at 10:05

Thanks a lot for your comment, this was very helpfull. As you suggested I am using the package *ETLUtils* in order to download data in *ffdf* from the DB however I get stucked at a certain point with the following error msg: *"Error in UseMethod ("recodeLevels"): not applicable method for 'recodeLevels' on object of class "c ('integer', 'numeric')"*. I am using: *read.odbc.ffdf(query = query,odbcConnect.args=list(dsn="dbs",uid="",pwd=""), first.rows = 100, next.rows = 1000, VERBOSE=TRUE)*. Do you have any advise? – ddg Nov 07 '12 at 08:05
It is hard to see without more information related to your specific data but it looks like one of your columns is changing from numeric to character/factor or vice versa from one chunk to another. My advice would be to inspect your data to see where that happens and use parameter transFUN if needed to make sure it always stays the same R data type. – Nov 07 '12 at 13:31
It is in fact as you mention, there is one column which sometimes contains numbers sometimes letters. I modified the parameter *next.rows* to *500'000*. It actually solved the problem (no more error msg) and speeded up the whole process (time to download the data went from 6h to 50m). Still, I would be interested in how to use the *transFUN* could you point to any example? How would you have used in this case? Thanks in advance. – ddg Nov 07 '12 at 15:49
with read.odbc.ffdf, you get data from your query in chunks. You can apply a function on this chunk. So if you know your column should be a character/factor, make sure it is so that the second chunk does not try to add a numeric column to a factor column. As in transFUN=function(x){x$yourcolumn <- as.factor(x$yourcolumn); x} – Nov 07 '12 at 22:09

DNA · Answer 3 · 2012-11-01T22:15:01.080

2

Sorry I can't help with ff, but on the topic of the RAM: I'm not familiar with the memory usage of R data frames, but for sake of argument let's say each cell takes 8 bytes (e.g. a double-precision float or long integer).

21 million * 15 * 8 bytes = about 2.5 GB.

Update and see the comments below; this figure is probably an underestimate!

So you could really do with more RAM, and a 64-bit machine would help a lot as 32-bit machines are limited to 4GB (and can't use that fully).

Might be worth trying a subset of the dataset so you know how much you can load with your existing RAM, then extrapolate to estimate how much you actually need. If you can subdivide the data and process it in chunks, that would be great, but lots of problems don't submit to that approach easily.

Also, I have been assuming that you need all the columns! Obviously, if you can filter the data in any way to reduce the size (e.g. removing any irrelevant columns) than that may help greatly!

edited Nov 01 '12 at 22:15

answered Nov 01 '12 at 21:57

DNA

42,007
12
107
146

A rough rule of thumb that is commonly cited is that you need ~3x the memory as the size of your object to do anything beyond simply load it in memory in R. (R can do lots of copying.) – joran Nov 01 '12 at 22:09
Data frames add approximately 80-100% overhead (due to advanced indexing, I suppose), so you can duble your estimated memory. To save memory you can use matrices that give almost zero overhead. – ffriend Nov 01 '12 at 22:11
Thanks for the details - so multiplying my estimates by 2 or 3 strongly reinforces the need for more RAM (and makes 64-bit essential) unless an entirely different (disk-centric) processing model can be used, or the slowdown of virtual memory/memory mapping can be tolerated. – DNA Nov 01 '12 at 22:14

score 1 · Answer 4 · edited May 23 '17 at 12:13

1

There's another very similar question. In particular, one way to to handle your data is to write it to the file and then map memory region to it (see, for example, mmap package).

edited May 23 '17 at 12:13

Community

1
1

answered Nov 01 '12 at 22:08

ffriend

27,562
13
91
132

While true, the data are already on-file on the DB server. I don't see much advantage to this unless the DB server (or connection) is _really_ slow. – Joshua Ulrich Nov 01 '12 at 22:36
@JoshuaUlrich: it's not always the case, you may connect to the remote DB. Also using RDMS implies using SQL and most probably comes down to splitting data into chunks, which is not always possible. Anyway, I gave the answer that seems to me the most simple to implement, and also a link to a number of other nice answers if this one doesn't fit TC's needs. – ffriend Nov 02 '12 at 06:43

Work in R with very large data set

4 Answers4

Linked