3

I am trying to use the code below to import a 4GB database (about 9,000,000 obs and 100 variables) into R using a windows 10 with 8GB RAM

library(feather)
memory.limit(size=99999)

rais_transp = read_feather('rais_transp.feather')

but every time I try to run it I get the following error message

"r encountered a fatal error: the session was terminated"

I have tried to download only one column but still I get the same message and the session is restarted

rais_transp = read_feather('rais_transp.feather', columns=c('black'))

I used to be able to handle this database on my computer, but now I can't run it anymore.

Somebody to help me?

Thanks

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 32 bit or 64bit R? – CALUM Polwart Oct 26 '21 at 00:15
  • I'm using 64bit –  Oct 26 '21 at 00:21
  • (Nothing to do with dplyr.) R needs to have a contiguous block of memory for any new object. Since numeric values take up about 10 bytes, you would need 9,000,000 times 100 times 10 to hold just the dataframe that was read. Not plausible on a 8 GB machine. To do anything useful you probably need to have 3 times the amount of memory taken up by your largest object. When I was actively working, I needed 32 GB to work with a 6 million by 30 variable database. It was the reason I got a Mac rather than a Windows machine in 2008. – IRTFM Oct 26 '21 at 00:42
  • This is the thread with the most votes: https://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session It's old but I think it still applies. – IRTFM Oct 26 '21 at 01:01
  • As far as I know the tidyverse does not really address the issue. On the other hand there are functions within the bigmemory: https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb/5174383#5174383 and RHadoop packages: https://stackoverflow.com/questions/29646482/how-to-install-rhadoop-packages-rmr-rhdfs-rhbase?r=SearchResults&s=3|0.0000 I'm not sure which of these is the right one to use for closing as a duplicate. I'm not closing this because there's always hope that a new package will arrive to fix this perennial problem. – IRTFM Oct 26 '21 at 01:02
  • If your dataset is truly too big to fit into memory at this point, you're probably going to need to partition it on disk and use the arrow package (or alternatively something like the disk.frame package, but I don't think this has support for feather files, so it'd need to be converted to a csv first I think) to work with it. This vignette for the arrow package outlines how to read a file that is too big to fit into memory, and then write it back out as a partitioned dataset to work with further (see section: More Dataset Options): https://arrow.apache.org/docs/r/articles/dataset.html – danh Oct 26 '21 at 02:42

1 Answers1

0

I had a similar problem with a .csv and I followed the next steps:

As you are using windows you should install cygwin to preprocess the file. Once installed you can split your file in smaller chuncks using the cygwin shell and writting on it:

split -b100m rais_transp.csv

You will have to convert it to a csv as danh pointed. The command -b100m means that your new file chunks will have a size of 100MB. As to feather size is smaller than a csv file maybe you need to make smaller chuncks. You can get chuncks of 1MB for example with -b 1024k.

You can find useful information related in the points 5.3.2 and 6.6 of the book Efficient R Programming.

Here you can find the link to check this points: https://csgillespie.github.io/efficientR/data-carpentry.html#working-with-databases

M_1
  • 119
  • 1
  • 10