R tabulate large table

Question

I have an huge table (millions of rows and 2 columns) similar to the one below.

FIELD1 FIELD2
M01 ABC
M02 ABC
MO3 ABC
M03 CDE
M04 FGH
M05 LMN
M06 FGH
... ...

FIELD1 may have millions of unique values and FIELD2 may have up to 10,000 unique value. I am using the following statement in R to read the txt file and to tabulate it:

dat<-read.table(file.choose(new = FALSE), sep = "\t")
m=table(dat)

However it returns the following error

Error: cannot allocate vector of size 1.5 Gb R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region set a breakpoint in malloc_error_break to debug R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region * set a breakpoint in malloc_error_break to debug*

Any idea how to overcome this limit? Many thanks.

Did you get the error with the `read.table` part, or the `table` part? That is, did you succeed in reading your dataset into R before trying to tabulate it? — Hong Ooi, Dec 17 '11 at 11:08
possible duplicate of [Extend memory size limit in R](http://stackoverflow.com/questions/5749058/extend-memory-size-limit-in-r) — Ari B. Friedman, Dec 17 '11 at 11:25
What OS are you using? Is it 64-bit? How much memory do you have? Did you recently restart your system? Windows is notorious for fragmenting memory.... — Ari B. Friedman, Dec 17 '11 at 12:19
Welcome to big data. http://cran.r-project.org/web/views/HighPerformanceComputing.html — Roman Luštrik, Dec 17 '11 at 12:20
Slightly OT, but FIELD1 may have a data cleanliness issue if your example is accurate. In row4, there is `MO3` rather than `M03`. — Michelle, Dec 17 '11 at 18:59

Martin Morgan · Accepted Answer · 2011-12-18T15:33:30.477

table will try to create a matrix of length(unique(FIELD1)) by length(unique(FIELD2)), presumably many times the size of your original data; a more efficient representation is as a data frame of counts, along the lines of

lst <- with(test, lapply(split(as.character(FIELD1), FIELD2), table))
df <- data.frame(FIELD1 = unlist(lapply(lst, names), use.names=FALSE), 
                FIELD2 = rep(names(lst), sapply(lst, length)),
                Count = unlist(lst, use.names=FALSE))

This could be represented as a sparse matrix

library(Matrix)
m <- with(df, {
    sparseMatrix(as.integer(FIELD1), as.integer(FIELD2), x=Count, 
                 dimnames=list(levels(FIELD1), levels(FIELD2)))
})

leading to

> m
7 x 4 sparse Matrix of class "dgCMatrix"
    ABC CDE FGH LMN
M01   1   .   .   .
M02   1   .   .   .
M03   .   1   .   .
M04   .   .   1   .
M05   .   .   .   1
M06   .   .   1   .
MO3   1   .   .   .

> colSums(m)
[1] 3 1 2 1

Actually what I would need is a matrix (Field1,Field2) on which I have to make some matricial operations. The only solution I see now is to go vector by vector. — Daniele, Dec 18 '11 at 11:49
The Matrix package has a sparse matrix representation that might be sufficient. — Martin Morgan, Dec 18 '11 at 15:34

G. Grothendieck · Answer 2 · 2011-12-18T23:13:04.077

This solution uses a single read.csv.sql statement to read the data into an SQLite database (which it automatically creates), perform the calculation in the database (not in R) and then only read it into R after the calculation has ben performed. Thus although it does not do a table it still shows which FIELD1 values are associated with each FIELD2 value and does so via a more compact representation.

First create some test data:

# create test file
test <- data.frame(FIELD1 = c("M01", "M02", "MO3", "M03", "M04", "M05", "M06"), 
    FIELD2 = c("ABC", "ABC", "ABC", "CDE", "FGH", "LMN", "FGH"))
write.csv(test, file = "test.csv", row.names = FALSE, quote = FALSE)

Then try this:

library(sqldf)
DF <- read.csv.sql("test.csv", 
    sql = "select FIELD2, group_concat(FIELD1) FIELD1 from file group by FIELD2")

With the test data the result is:

>  DF
  FIELD2      FIELD1
1    ABC M01,M02,MO3
2    CDE         M03
3    FGH     M04,M06
4    LMN         M05

R tabulate large table

2 Answers2