2

I have an huge table (millions of rows and 2 columns) similar to the one below.

  1. FIELD1 FIELD2
  2. M01 ABC
  3. M02 ABC
  4. MO3 ABC
  5. M03 CDE
  6. M04 FGH
  7. M05 LMN
  8. M06 FGH
  9. ... ...

FIELD1 may have millions of unique values and FIELD2 may have up to 10,000 unique value. I am using the following statement in R to read the txt file and to tabulate it:

dat<-read.table(file.choose(new = FALSE), sep = "\t")
m=table(dat)

However it returns the following error

Error: cannot allocate vector of size 1.5 Gb R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region set a breakpoint in malloc_error_break to debug R(390,0xac0442c0) malloc: mmap(size=1599119360) failed (error code=12) error: can't allocate region * set a breakpoint in malloc_error_break to debug*

Any idea how to overcome this limit? Many thanks.

Community
  • 1
  • 1
Daniele
  • 55
  • 4

2 Answers2

4

table will try to create a matrix of length(unique(FIELD1)) by length(unique(FIELD2)), presumably many times the size of your original data; a more efficient representation is as a data frame of counts, along the lines of

lst <- with(test, lapply(split(as.character(FIELD1), FIELD2), table))
df <- data.frame(FIELD1 = unlist(lapply(lst, names), use.names=FALSE), 
                FIELD2 = rep(names(lst), sapply(lst, length)),
                Count = unlist(lst, use.names=FALSE))

This could be represented as a sparse matrix

library(Matrix)
m <- with(df, {
    sparseMatrix(as.integer(FIELD1), as.integer(FIELD2), x=Count, 
                 dimnames=list(levels(FIELD1), levels(FIELD2)))
})

leading to

> m
7 x 4 sparse Matrix of class "dgCMatrix"
    ABC CDE FGH LMN
M01   1   .   .   .
M02   1   .   .   .
M03   .   1   .   .
M04   .   .   1   .
M05   .   .   .   1
M06   .   .   1   .
MO3   1   .   .   .

> colSums(m)
[1] 3 1 2 1
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Actually what I would need is a matrix (Field1,Field2) on which I have to make some matricial operations. The only solution I see now is to go vector by vector. – Daniele Dec 18 '11 at 11:49
  • The Matrix package has a sparse matrix representation that might be sufficient. – Martin Morgan Dec 18 '11 at 15:34
3

This solution uses a single read.csv.sql statement to read the data into an SQLite database (which it automatically creates), perform the calculation in the database (not in R) and then only read it into R after the calculation has ben performed. Thus although it does not do a table it still shows which FIELD1 values are associated with each FIELD2 value and does so via a more compact representation.

First create some test data:

# create test file
test <- data.frame(FIELD1 = c("M01", "M02", "MO3", "M03", "M04", "M05", "M06"), 
    FIELD2 = c("ABC", "ABC", "ABC", "CDE", "FGH", "LMN", "FGH"))
write.csv(test, file = "test.csv", row.names = FALSE, quote = FALSE)

Then try this:

library(sqldf)
DF <- read.csv.sql("test.csv", 
    sql = "select FIELD2, group_concat(FIELD1) FIELD1 from file group by FIELD2")

With the test data the result is:

>  DF
  FIELD2      FIELD1
1    ABC M01,M02,MO3
2    CDE         M03
3    FGH     M04,M06
4    LMN         M05
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341