This matter has been discussed by the main creator of data.table
package at https://github.com/Rdatatable/data.table/issues/3526. See the comment by Matt Dowle himself at https://github.com/Rdatatable/data.table/issues/3526#issuecomment-488364641. From what I understand, the gist of the matter is that to read even 10 lines from a huge csv file with fread
, the entire file needs to be memory mapped. So fread
cannot be used on its own in case your csv file is too big for your machine. Please correct me if I'm wrong.
Also, I haven't been able to use vroom
with big more-than-RAM csv files. Any pointers towards this end will be appreciated.
For me, the most convenient way to check out a huge (gzipped) csv file is by using a small command line tool csvtk
from https://bioinf.shenwei.me/csvtk/
e.g., check dimensions with
csvtk dim BigFile.csv.gz
and, check out head with top 100 rows
csvtk head -n100 BigFile.csv.gz
get a better view of above with
csvtk head -n100 BigFile.csv.gz | csvtk pretty | less -SN
Here I've used less
command available with "Gnu On Windows" at https://github.com/bmatzelle/gow
A word of caution - many people suggest using command
wc -l BigFile.csv
to check out no. of lines
from a big csv file. In most cases, it will be equal to the no. of rows
. But in case the big csv file contains newline characters within a cell, to use a spreadsheet term, the above command will not show the no. of rows
. In such cases the no. of lines
is different from the no. of rows
. So it is advisable to use csvtk dim
or csvtk nrow
. Other csv command line tools like xsv
, miller
will also show correct results.
Another word of caution - the short command fread(cmd="head -n 10 BigFile.csv")
is not advisable to preview top few lines in case some columns contain significant leading zeros in data such as 0301, 0542, etc. since without column specification, fread
will interpret them as integers and not show leading zeros from such columns. For example, in some databases that I have to analyse, the first digit zero in a particular column means that it is a Revenue Receipt
. So better use a command line tool like csvtk
, miller
, xsv
with less -SN
for previewing a big csv file which show the file "as is" without any potentially wrong interpretation.
PS1: Even spreadsheets like MS Excel and LibreOffice Calc loses leading zeroes in csv files by default. LibreOffice Calc actually shows leading zeroes in the preview window but loses them when you load the file! I'm yet to find a spreadsheet that does not lose leading zeroes in csv files by default.
PS2: I've posted my approach to querying very large csv files at https://stackoverflow.com/a/68693819/8079808
EDIT:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203