6

I have a .csv file that is 112GB in weight but neither vroom nor data.table::fread will open it. Even if I ask to read in 10 rows or just a couple of columns it complains with mapping error: Cannot allocate memory.

    df<-data.table::fread("FINAL_data_Bus.csv", select = c(1:2),nrows=10)
System errno 22 unmapping file: Invalid argument
Error in data.table::fread("FINAL_data_Bus.csv", select = c(1:2), nrows = 10) : 
  Opened 112.3GB (120565605488 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.

read.csv on the other hand will read the ten rows happily.

Why won't vroom or fread read it using the usual altrep, even for 10 rows?

HCAI
  • 2,213
  • 8
  • 33
  • 65
  • 1
    Assuming you are in the working directory, try this code `fread(cmd="head -n 10 FINAL_data_Bus.csv", select=1:2)`. – B. Christian Kamgang Aug 05 '21 at 09:53
  • That works, thank you. What will that show me? – HCAI Aug 05 '21 at 10:33
  • 2
    It shows the first 10 rows (including the row containing the column names). Do you need to read all the data? You can try to read it in chunks while filtring only the columns and rows that you need for your work. Note that the CL `split` command is quite handy for splitting. What are the numbers of columns and rows of your dataset? If you do not know, try something like: `ncol(fread(cmd="head -n 10 FINAL_data_Bus.csv"))` and `system("wc -l FINAL_data_Bus.csv", TRUE)`. – B. Christian Kamgang Aug 05 '21 at 11:56
  • 1
    @B.ChristianKamgang - These comments would usefully be made into an answer as the matter comes up often and your approach provides both methods of inspection and guidance toward chunking. My thoughts and request. – Chris Aug 05 '21 at 13:54
  • @B.ChristianKamgang So we have to split the large csv file first and read all chunks using some kind of loop? If so, `xsv` from https://github.com/BurntSushi/xsv does a better job than the CL `split` since it keeps headers in each chunk. However, I'm still looking for a way to read large gzipped csv file without having to split it first. – San Aug 07 '21 at 13:10
  • @San, while ensuring headers is certainly convenient, I don't think using "xsv" is strictly necessary: using simply `split -l` (cli tool, not R) the first file will have headers and not the others, but one can easily `fread("split1.csv")` (normal names), then `fread("split2.csv")` and then assign column names from split1 (etc). – r2evans Aug 09 '21 at 16:33
  • @r2evans Certainly, but using `split` involves more work compared to `xsv`. – San Aug 10 '21 at 01:36

1 Answers1

6

This matter has been discussed by the main creator of data.table package at https://github.com/Rdatatable/data.table/issues/3526. See the comment by Matt Dowle himself at https://github.com/Rdatatable/data.table/issues/3526#issuecomment-488364641. From what I understand, the gist of the matter is that to read even 10 lines from a huge csv file with fread, the entire file needs to be memory mapped. So fread cannot be used on its own in case your csv file is too big for your machine. Please correct me if I'm wrong.

Also, I haven't been able to use vroom with big more-than-RAM csv files. Any pointers towards this end will be appreciated.

For me, the most convenient way to check out a huge (gzipped) csv file is by using a small command line tool csvtk from https://bioinf.shenwei.me/csvtk/

e.g., check dimensions with

csvtk dim BigFile.csv.gz

and, check out head with top 100 rows

csvtk head -n100 BigFile.csv.gz

get a better view of above with

csvtk head -n100 BigFile.csv.gz | csvtk pretty | less -SN

Here I've used less command available with "Gnu On Windows" at https://github.com/bmatzelle/gow

A word of caution - many people suggest using command

wc -l BigFile.csv

to check out no. of lines from a big csv file. In most cases, it will be equal to the no. of rows. But in case the big csv file contains newline characters within a cell, to use a spreadsheet term, the above command will not show the no. of rows. In such cases the no. of lines is different from the no. of rows. So it is advisable to use csvtk dim or csvtk nrow. Other csv command line tools like xsv, miller will also show correct results.

Another word of caution - the short command fread(cmd="head -n 10 BigFile.csv") is not advisable to preview top few lines in case some columns contain significant leading zeros in data such as 0301, 0542, etc. since without column specification, fread will interpret them as integers and not show leading zeros from such columns. For example, in some databases that I have to analyse, the first digit zero in a particular column means that it is a Revenue Receipt. So better use a command line tool like csvtk, miller, xsv with less -SN for previewing a big csv file which show the file "as is" without any potentially wrong interpretation.

PS1: Even spreadsheets like MS Excel and LibreOffice Calc loses leading zeroes in csv files by default. LibreOffice Calc actually shows leading zeroes in the preview window but loses them when you load the file! I'm yet to find a spreadsheet that does not lose leading zeroes in csv files by default.

PS2: I've posted my approach to querying very large csv files at https://stackoverflow.com/a/68693819/8079808

EDIT:

VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203

San
  • 518
  • 5
  • 14