Concatenate long list of big matrices efficiently

Question

I have a 4 Gb csv file to load in my 16Gb machine, fread and read.csv can't load it at once, they return memory errors.

So I decided to read the file by chunks, and it worked (after one hour or so), and I get a list of data.frames that takes 2.5 Gb if I trust the Environment tab in RStudio, and 1.2 Gb when saved as an RDS.

The issue I have now is concatenating everything back into a big data.frame. from what I understand rbindlist is the most efficient solution (or is it bind_rows ?), but in my case it still uses too much memory.

I think I can solve this by using rbindlist on list items n by n, then recursively up to when I get my final list. This n number would have to be calibrated manually though and this process is really ugly (on top of this annoying csv importation).

Another idea that crossed my mind is to find a way to feed an SQLite database from my loaded data, and then query it from R (I'll only do subset, min and max operations on the data).

Can I do better than this ?

My data is only made of integer and double, if it makes a difference.

Why do you want to construct a data.frame if your data is only numeric (why not a matrix? / sparse matrix) — talat, Sep 01 '17 at 12:44
@docendo-discimus I'll be happy enough If I get a big matrix, though at this point I thing it will not cost much to convert it to a `data.frame` — moodymudskipper, Sep 01 '17 at 13:01
@Hong-Ooi I can't install many softwares on my system. the data comes from SAS but I can't access it directly from R, hence the export to csv. I though about re exporting it to `SQLite` from my chunks as I think there's nothing other than r libraries to install, would that be what you advise ? — moodymudskipper, Sep 01 '17 at 13:03
I would go for sqlite. Read the first answer of this question : https://stackoverflow.com/questions/4332976/how-to-import-csv-into-sqlite-using-rsqlite — digEmAll, Sep 01 '17 at 13:18
thanks @digEmAll , from your link I found this one: https://stackoverflow.com/questions/4350131/unix-importing-large-csv-into-sqlite which on my system is faster and takes less space, but same idea. I'll add it as an answer, though it seems CPak's solution may be the best one for me — moodymudskipper, Sep 01 '17 at 14:19
Yeah, bigmemory is great... unfortunately there was a period when it wasn't available for windows, so I stopped using it (I have to work with windows in my company)... but now it seems it is available on MS OS again, good to know — digEmAll, Sep 02 '17 at 09:29
My approach to running queries on very large (compressed) csv files: https://stackoverflow.com/a/68693819/8079808 — San, Aug 09 '21 at 10:19

score 4 · Accepted Answer · answered Sep 01 '17 at 13:59

4

Sounds like bigmemory might have just enough functionality for your problem

require(bigmemory)

Read files

You can read files in as a big.matrix with

read.big.matrix(filename, sep = ",", header = FALSE, col.names = NULL,
    row.names = NULL, has.row.names = FALSE, ignore.row.names = FALSE,
    type = NA, skip = 0, separated = FALSE, backingfile = NULL,
    backingpath = NULL, descriptorfile = NULL, binarydescriptor = FALSE,
    extraCols = NULL, shared = TRUE)

Save memory

Even with a simple example like iris, you can see memory savings

x <- as.big.matrix(iris)
options(bigmemory.allow.dimnames=TRUE)
colnames(x) <- c("A", "B", "C", "D", "E")

object.size(x)
# 664 bytes

object.size(iris)
# 7088 bytes

Subsetting

Subsetting big.matrices is limited but some functionality is provided with mwhich

Subset if column 1 is <= 5, AND column 2 <= 4

x[mwhich(x, 1:2, list(c(5), c(4)), list(c('le'), c('le')), 'AND'),]

#       A   B   C   D E
# 2   4.9 3.0 1.4 0.2 1
# 3   4.7 3.2 1.3 0.2 1
# 4   4.6 3.1 1.5 0.2 1
# 5   5.0 3.6 1.4 0.2 1
# etc

NOTE the result of a subset operation is a regular matrix. You can convert a regular matrix to big.matrix with as.big.matrix()

Min, max, mean, etc

biganalytics provides more functionality with big.matrices

require(biganalytics)

colmin(x, cols = 1:2, na.rm = FALSE)
#   A   B 
# 4.3 2.0

colmax(x, cols = 1:2, na.rm = FALSE)
#   A   B 
# 7.9 4.4

Output

Finally you can output as big.matrix with

write.big.matrix(...)

answered Sep 01 '17 at 13:59

CPak

13,260
3
30
48

This is great, thanks, I tested it on a smaller table and it seems really fast. I think it's probably the best solution until the data is really to big and can't be loaded, or we need to have more free RAM, then using SQLite becomes a better choice. I'm testing both ways now on my big dataset and will add the solution with `RSQLite` as an answer for completeness. – moodymudskipper Sep 01 '17 at 14:16
I think there is something you missed in you "Save memory" section, it seems like ANY `big.matrix` object weighs 664 bytes , try: `x <- as.big.matrix(rbind(iris,iris,iris,iris,iris,iris,iris,iris,iris,iris,iris));object.size(x)`. And strangely, when I save the RDS, I only save 1kb. So maybe bigmemory only builds a connection to the file ? It still took like 2 hours for it to make this 664 byte variable, so I suppose some data must be somewhere, and I'd like to save it, but I have no idea how. – moodymudskipper Sep 01 '17 at 15:36
Yes, `big.matrices` are essentially pointers to a file. I didn't realize it would always be the same object.size but in hindsight it makes sense. The underlying data itself is the same...so if you want to save the full matrix as a RDS file, you will need to convert it to a regular matrix, and that kind of defeats the purpose. `bigmemory` let's you operate on the data lazily, so if you only need a subset of your data, it lets you load only the subset (rather than the full matrix) into memory. – CPak Sep 01 '17 at 15:48
Then I'll need to test if `readRDS` is faster than `read.big.matrix` and if it is, I'll save as RDS AFTER a round of `read.big.matrix` :). – moodymudskipper Sep 01 '17 at 16:05

moodymudskipper · Answer 2 · 2017-09-01T22:01:10.507

Following hints in the comments I ended up checking this solution, though I'll probably end up accepting @CPak's solution in the end (I'll edit this post in time with final info).

For my specific case I use it this way, first create the database and feed it with my table :

library(RSQLite)
library(dplyr)
# define dbpath (ending with ".SQLite" to be clean), my_table_name, csv_path
db <- dbConnect(SQLite(), dbname = dbpath) # will create databse if it doesn't exist, and a connection
dbWriteTable(conn=db, name=my_table_name, value=csv_path, row.names=FALSE, header=TRUE) # creates table in DB
dbDisconnect(db)

Then access it :

db    <- dbConnect(SQLite(), dbname= dbpath) # creates a connection to db
my_table <- tbl(db, my_table_name)

Then my_table behaves pretty much like a data.frame, I think there's some limitations but for basic operations it works just fine.

The database that is created has approximately the same size as the csv source, so more or less 4 times more than a RDS file, but with the big advantage one doesn't need to load it in memory.

Edit: It may be worth investigating readr::read_csv_chunked and chunked::read_csv_chunkwise