-1

I'm trying to read 6+ GB csv file to do some aggregation. I'm using the following methods:

read.table('csv_file',sep=",", head=T, stringsAsFactors=F)
read.csv("csv_file",as.is=T,header=F,quote="")

However, regardless of the method I'm getting some errors like below:

Warning message:

In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
EOF within quoted string

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
more columns than column names

I've seen many people have raised similar errors but none of the suggestions worked for me so far.

Appreciate if someone can shed some light into this. Thanks in advance.

SusanthaB
  • 31
  • 2
  • 3
    Possible duplicate of [read.csv warning 'EOF within quoted string' prevents complete reading of file](http://stackoverflow.com/questions/17414776/read-csv-warning-eof-within-quoted-string-prevents-complete-reading-of-file) – SabDeM Jan 20 '16 at 23:26
  • Break up the file into smaller files and see if you can read each of these to determine where the problem is. When you have a very small file that reproduces the error post it there and we might be able to work out what's causing the problem. – Mist Jan 20 '16 at 23:44
  • Another possible duplicate (with different answers, and do note that the `comment.char` needs to be considered as well as `quote`): http://stackoverflow.com/questions/17763294/all-lines-not-being-read-while-executing-read-csv-in-r/17763959?s=4|0.3738#17763959 – IRTFM Jan 21 '16 at 00:20

2 Answers2

0

To get insight use the construction:

table( count.fields("~/Downloads/test1.txt", sep=",", quote="", comment.char="", skip=0) )

If there are just a couple of lines with weirdness, then you can narrow it down with different values of 'skip'. I haven't used this on files quite that big, but have used on files half that size. The count.fields result can also be used to identify the row numbers with particular discrepancies in field number. If you get something indicating that 10 rows have one less than the expected number of columns say 20, then do this:

which( 
  count.fields("~/Downloads/test1.txt", sep=",", quote="", comment.char="", skip=0) == 9)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

the headers are probably not on line 1 and you need to skip a specific amount of lines to get to the headers

Box and Cox
  • 174
  • 2
  • 5