0

I am attempting to read a very large .csv file into R

OrigTotalPumping <- read.table("/Users/Larry/Desktop/OrigTotalPumping/MasterOrigFt3.csv", 
                               header=F, sep=",")

I get an error back

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 19437 did not have 7 elements

So I tried to simply skip that line, but the same error occurs for multiple lines in the file, and the file is far too large to open and manually fix. For example

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 7 elements

and

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 19439 did not have 7 elements

How do I set it so it can read in this file despite this problem?

jbaums
  • 27,115
  • 5
  • 79
  • 119
user3736201
  • 99
  • 2
  • 6
  • Well, do you know exactly what's causing that problem? Do you know why that line is shorter than the others? – MrFlick Jul 13 '14 at 06:27
  • why dont you give a try to colclasses parameter to defined no. of columns in your csv. – Aashu Jul 13 '14 at 06:33
  • You can print the bad line with `scan("MasterOrigFt3.csv", "character", skip=19436, n=1, sep="\n")` – MrFlick Jul 13 '14 at 06:33
  • @MrFlick no I don't; the file is so large I can't even open it to view. – user3736201 Jul 13 '14 at 06:46
  • @user3736201 that's why Flick was suggesting you `scan` just the bad line. R should have no problem doing that. Similarly, you could assign *all* "bad" lines to `baddies` with `txt <- readLines("/Users/Larry/Desktop/OrigTotalPumping/MasterOrigFt3.csv"); baddies <- txt[which(count.fields(textConnection(txt), ',') != 7]`. – jbaums Jul 13 '14 at 06:52
  • To read in just the lines with 7 fields, you can use: You could do something like: `txt <- readLines("/Users/Larry/Desktop/OrigTotalPumping/MasterOrigFt3.csv"); OrigTotalPumping <- read.csv(text=txt[which(count.fields(textConnection(txt), ',') == 7)], header=FALSE)` (This corrects a typo from my previous, now-deleted comment). – jbaums Jul 13 '14 at 06:55
  • Didn't see that comment before I replied. I did scan just now and the lines have the same number of elements – user3736201 Jul 13 '14 at 06:56
  • On another note, see [this post](http://stackoverflow.com/q/159521/489704) for suggestions of text editors that can open/edit text files whose size is of the order of multiple GB. – jbaums Jul 13 '14 at 07:01
  • I think I will open it up and have a look in vim but I don't think there's anything to edit from the looks of it... it seems (from looking at specific lines) they have 7 elements? – user3736201 Jul 13 '14 at 07:07
  • Also, maybe try to set `,quote = ""` in `read.table()`. This is what usually causes these types of errors – David Arenburg Jul 13 '14 at 08:38
  • @user3736201 Can you edit your question to include the bad line? Is there a `#` or `'` in there? That can be fixed by setting the `quote=` or `comment=` parameters. – MrFlick Jul 13 '14 at 17:41

2 Answers2

0

Don't skip it, but better put this code under a loop and under a trycatch block. And do a error handling for issued record/records. How to do trycatch in R: Here is the link for the same

Community
  • 1
  • 1
ak0053792
  • 523
  • 1
  • 5
  • 18
0

Can you pre-process the file with awk like this to create a cleaned-up file where every line has 7 fields?

awk 'NF==7' original.csv > cleaned.csv

or, if your field separator is a comma

awk -F',' 'NF==7' original.csv > cleaned.csv

Or, building on @Spacedman's idea, you could use awk to just show you "unhappy" lines that don't have 7 fields

awk -F',' 'NF!=7 {print "Line " NR $0}' yourfile.csv | more
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Sounds FANTASTIC! The field separators are commas, but I got back an empty file after it ran – user3736201 Jul 13 '14 at 08:04
  • Please get a few lines from your file and edit them into your question. Get them like this `head -3 yourfile.csv > sample.csv` – Mark Setchell Jul 13 '14 at 08:07
  • Better still you can use `awk` to look at the failing lines: `awk 'NR==19437' original.csv` or `awk 'NR>19400' original.csv | more` and figure out whether it has missing commas, quoted commas etc etc. (`NR` in `awk` is "Record Number"). And a half-decent text editor should be able to read in a big CSV file for editing if you have to do it manually. – Spacedman Jul 13 '14 at 09:54
  • @Spacedman Good suggestion, thanks, I have edited (a modification of) it into my answer so all can see without trawling the Comments. – Mark Setchell Jul 13 '14 at 11:07