Error when reading large CSV file in R

Question

I'm trying to read 6+ GB csv file to do some aggregation. I'm using the following methods:

read.table('csv_file',sep=",", head=T, stringsAsFactors=F)
read.csv("csv_file",as.is=T,header=F,quote="")

However, regardless of the method I'm getting some errors like below:

Warning message:

In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
EOF within quoted string

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
more columns than column names

I've seen many people have raised similar errors but none of the suggestions worked for me so far.

Appreciate if someone can shed some light into this. Thanks in advance.

Possible duplicate of [read.csv warning 'EOF within quoted string' prevents complete reading of file](http://stackoverflow.com/questions/17414776/read-csv-warning-eof-within-quoted-string-prevents-complete-reading-of-file) — SabDeM, Jan 20 '16 at 23:26
Break up the file into smaller files and see if you can read each of these to determine where the problem is. When you have a very small file that reproduces the error post it there and we might be able to work out what's causing the problem. — Mist, Jan 20 '16 at 23:44
Another possible duplicate (with different answers, and do note that the `comment.char` needs to be considered as well as `quote`): http://stackoverflow.com/questions/17763294/all-lines-not-being-read-while-executing-read-csv-in-r/17763959?s=4|0.3738#17763959 — IRTFM, Jan 21 '16 at 00:20

IRTFM · Answer 1 · 2016-01-21T00:33:23.813

To get insight use the construction:

table( count.fields("~/Downloads/test1.txt", sep=",", quote="", comment.char="", skip=0) )

If there are just a couple of lines with weirdness, then you can narrow it down with different values of 'skip'. I haven't used this on files quite that big, but have used on files half that size. The count.fields result can also be used to identify the row numbers with particular discrepancies in field number. If you get something indicating that 10 rows have one less than the expected number of columns say 20, then do this:

which( 
  count.fields("~/Downloads/test1.txt", sep=",", quote="", comment.char="", skip=0) == 9)

score 0 · Answer 2 · answered Aug 19 '18 at 20:10

0

the headers are probably not on line 1 and you need to skip a specific amount of lines to get to the headers

answered Aug 19 '18 at 20:10

Box and Cox

174
2
5

Error when reading large CSV file in R

2 Answers2