Practical limits of R data frame

Asked Mar 08 '11 at 14:24

Active Mar 30 '16 at 04:06

Viewed 8.6k times

I have been reading about how read.table is not efficient for large data files. Also how R is not suited for large data sets. So I was wondering where I can find what the practical limits are and any performance charts for (1) Reading in data of various sizes (2) working with data of varying sizes.

In effect, I want to know when the performance deteriorates and when I hit a road block. Also any comparison against C++/MATLAB or other languages would be really helpful. finally if there is any special performance comparison for Rcpp and RInside, that would be great!

edited Oct 13 '15 at 14:22

Ciro Santilli OurBigBook.com

347,512
102
1,199
985

asked Mar 08 '11 at 14:24

Egon

3,718
3
34
48

1

the [high performance task view](http://cran.r-project.org/web/views/HighPerformanceComputing.html) will almost certainly answer some of these questions. – Chase Mar 08 '11 at 14:32
Rcpp is a C++ library that wraps the R C API and RInside provides functionality to call R from C++ code. I.e., they're both constrained by R. – Joshua Ulrich Mar 08 '11 at 14:54
http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r – Ciro Santilli OurBigBook.com Oct 13 '15 at 14:22

5 Answers5

R is suited for large data sets, but you may have to change your way of working somewhat from what the introductory textbooks teach you. I did a post on Big Data for R which crunches a 30 GB data set and which you may find useful for inspiration.

The usual sources for information to get started are High-Performance Computing Task View and the R-SIG HPC mailing list at R-SIG HPC.

The main limit you have to work around is a historic limit on the length of a vector to 2^31-1 elements which wouldn't be so bad if R did not store matrices as vectors. (The limit is for compatibility with some BLAS libraries.)

We regularly analyse telco call data records and marketing databases with multi-million customers using R, so would be happy to talk more if you are interested.

edited Mar 30 '16 at 04:06

josliber

43,891
12
98
133

answered Mar 08 '11 at 15:07

Allan Engelhardt

1,421
10
5

Do you have any statistics on how much data crunching you are doing using R and how much slower is it than say C++ like languages ? Thnx - Egon – Egon Mar 08 '11 at 16:14
7

@Egon: My background is such that my first instinct on any data analysis problem is to fire up a text editor and a Fortran compiler. But these days I find that the limiting factor in the analysis is much more my time figuring out the right approach and translating that into code. For that, R is so much more productive for me. For sure I have to think performance carefully when using R, but in truth so must you with compiled code. I do have some elements that we have compiled, but they are relative few and they were almost all designed and tested in R first. That's why R is faster. – Allan Engelhardt Mar 08 '11 at 19:16

The physical limits arise from the use of 32-bit indexes on vectors. As a result, vectors up to 2^31 - 1 are allowed. Matrices are vectors with dimensions, so the product of nrow(mat) and ncol(mat) must be within 2^31 - 1. Data frames and lists are general vectors, so each component can take 2^31 - 1 entries, which for data frames means you can have that many rows and columns. For lists you can have 2^31 - 1 components, each of 2^31 - 1 elements. This is drawn from a recent posting by Duncan Murdoch in reply to a Q on R-Help

Now that all has to fit in RAM with standard R so that might be a more pressing limit, but the High-Performance Computing Task View that others have mentioned contains details of packages that can circumvent the in-memory issues.

answered Mar 08 '11 at 15:11

Gavin Simpson

170,508
25
396
453

3

Note that the limit of 2^31 - 1 elements only applies to R 2.x.y. In R 3.x.y the limit is higher (see https://cran.r-project.org/doc/manuals/R-ints.html#Long-vectors). – esel Mar 08 '18 at 14:27

1) The R Import / Export manual should be the first port of call for questions about importing data - there are many options and what will work for your could be very specific.

http://cran.r-project.org/doc/manuals/R-data.html

read.table specifically has greatly improved performance if the options provided to it are used, particular colClasses, comment.char, and nrows - this is because this information has to be inferred from the data itself, which can be costly.

2) There is a specific limit for the length (total number of elements) for any vector, matrix, array, column in a data.frame, or list. This is due to a 32-bit index used under the hood, and is true for 32-bit and 64-bit R. The number is 2^31 - 1. This is the maximum number of rows for a data.frame, but it is so large you are far more likely to run out of memory for even single vectors before you start collecting several of them.

See help(Memory-limits) and help(Memory) for details.

A single vector of that length will take many gigabytes of memory (depends on the type and storage mode of each vector - 17.1 for numeric) so it's unlikely to be a proper limit unless you are really pushing things. If you really need to push things past the available system memory (64-bit is mandatory here) then standard database techniques as discussed in the import/export manual, or memory-mapped file options (like the ff package), are worth considering. The CRAN Task View High Performance Computing is a good resource for this end of things.

Finally, if you have stacks of RAM (16Gb or more) and need 64-bit indexing it might come in a future release of R. http://www.mail-archive.com/r-help@r-project.org/msg92035.html

Also, Ross Ihaka discusses some of the historical decisions and future directions for an R like language in papers and talks here: http://www.stat.auckland.ac.nz/~ihaka/?Papers_and_Talks

edited Mar 08 '11 at 15:21

answered Mar 08 '11 at 15:12

mdsumner

29,099
6
83
91

When reading large csv files x GB <=> y.1e6 rows I think data.table::fread (as of version 1.8.7) is the quickest alternative you can get it doing install.packages("data.table", repos="http://R-Forge.R-project.org")

You usually gain a factor 5 to 10 (and all sep, row.names etc are dealt by the function itself). If you have many files and a decent enough computer (several cores), I recommend using the parallel package (as part of R.2.14) to load one file per core.

Last time I did this between monothreaded loading with read.csv and multithreaded on 4 cores use of fread I went from 5 minutes to 20 seconds

answered Feb 26 '13 at 21:01

statquant

13,672
21
91
162

I can only answer the one about read.table, since I don't have any experience with large data sets. read.table performs poorly if you don't provide colClasses arguments. Without it, read.table defaults to NA and tries to guess a class of every column, and that can be slow, especially when you have a lot of columns.

answered Mar 08 '11 at 15:01

aL3xa

35,415
18
79
112

Practical limits of R data frame

5 Answers5

Linked