0

I have a 20GB CSV file that I want to convert to an RDS file in R. However, the original file is too large to be processed (the computer with 64GB RAM tells me that 80.9GB needs to be allocated which exceeds its memory capacity). Therefore I am wondering, whether and how I can read that CSV in chunks, turn each chunk into a separate RDS file and afterward merge them together? Would that yield the same outcome as if I directly turned that one CSV file into an RDS file?

I am very new to R and could unfortunately not find any answers to my question.

Below is the current code I'm using.

library(Matrix)
library(data.table)

b <- fread('dtm.csv')
b_matx<- as.matrix(b)
dtm_b <- Matrix(b_matx, sparse = TRUE)

saveRDS(dtm_b, "dtm.rds")
Rani
  • 483
  • 7
  • 17
  • 2
    I'm afraid this won't work either. The computer will still need to allocate more memory. However, take a look at the [bigmemory](https://cran.r-project.org/package=bigmemory) package. Also, the question is: do you really need all the data in the CSV? Maybe you can read line by line (or in chunks, as suggested), and then keep only the part that you need. – January Nov 11 '22 at 14:46
  • Thanks for the suggestion, I'll take a look at it. And yes, I do need all the data. – Rani Nov 11 '22 at 14:52
  • Then maybe you want to make some calculations with the data? If yes, then potentially you could calculate the interesting statistics on the fly or in chunks. All summary stats can be calculated like this. How about you tell us something about the data? – January Nov 11 '22 at 15:09
  • If it's so big, your best bet is an actual relational database engine. There are plenty of free options that are pretty easy to use, e.g. SQLite, MariaDB/MySQL, PostgreSQL. The R package `DBI` allows you to access these databases from R directly and you can even use dplyr syntax with the package `dbplyr` – Ottie Nov 11 '22 at 15:13
  • @January The data is a document-term matrix (120.000 trigrams over 90.000 documents) and I need to convert it to RDS to run an algorithm from [this paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3630898) in R on the entire data to obtain a filtered matrix. Therefore I, unfortunately, can't omit parts of the data. – Rani Nov 11 '22 at 15:18
  • @G.Grothendieck That's what I did but at some point, this error pops up. – Rani Nov 11 '22 at 15:38
  • @G.Grothendieck Good point, that's what I just tried and that's the error I get `pyarrow.lib.ArrowMemoryError: malloc of size 730560 failed`. – Rani Nov 11 '22 at 16:16
  • Check out https://stackoverflow.com/a/73187941/6851825 and https://stackoverflow.com/questions/60928866/read-a-20gb-file-in-chunks-without-exceeding-my-ram-r – Jon Spring Nov 11 '22 at 17:05
  • @G.Grothendieck Yes, I did, similar error – Rani Nov 11 '22 at 17:37

1 Answers1

0

See if this works.

It reads one column at a time using fread. By default fread creates a data frame; however, these use external pointers which can be problem so we use data.table=FALSE argument. After reading a columni n it immediately writes it back out as an RDS file. After all columns have been written back out as RDS files it reads the RDS files back in and writes the final RDS file out which combines them. We use the 6 row input in the Note at the end as an example.

If fread with select= still takes up too much memory use the xsv utility (not an R program) to ensure that only the column of interest is read in. xsv can be downloaded for various platforms here and then use the commented out line instead of the line following it. (Alternately suitably use cut, sed or awk for the same purpose.)

You can also try interspersing the code lines with gc() to trigger garbage collection.

Also try replacing as.data.frame in the last line with setDT.

library(data.table)

File <- "BOD.csv"

freadDF <- function(..., data.table = FALSE) fread(..., data.table = data.table)
L <- as.list(freadDF(File, nrows = 0))
nms <- names(L)
fmt <- "xsv select %s %s"
# for(nm in nms) saveRDS(freadDF(cmd = sprintf(fmt, nm, File))[[1]], paste0(nm, ".rds"))
for(nm in nms) saveRDS(freadDF(File, select = nm)[[1]],  paste0(nm, ".rds"))

for(nm in names(L)) L[[nm]] <- readRDS(paste0(nm, ".rds"))
saveRDS(as.data.frame(L), sub(".csv$", ".rds", File))

Note

write.csv(BOD, "BOD.csv", quote = FALSE, row.names = FALSE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341