2

I have a number of csv-files of different size, but all somewhat big. Using read.csv to read them into R takes longer than I've been patient to wait so far (several hours). I managed to read the biggest file (2.6 gb) very fast (less than a minute) with data.table's fread.

My problem occurs when I try to read a file of half the size. I get the following error message:

Error in fread("C:/Users/Jesper/OneDrive/UdbudsVagten/BBR/CO11700T.csv",:

Expecting 21 cols, but line 2557 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=';' and/or (unescaped) '\n' characters within unbalanced unescaped quotes.

fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

Through research I've found suggestions to add quote = "" to the code, but it doesn't help me. I've tried using the bigmemory package, but R crashes when I try. I'm on a 64 bit system with 8 gb of ram.

I know there are quite a few threads on this subject, but I haven't been able to solve the problem with any of the solutions. I would really like to use fread (given my good experience with the bigger file), and it seems like there should be some way to make it work - just can't figure it out.

Community
  • 1
  • 1
Morten Nielsen
  • 325
  • 2
  • 4
  • 19
  • 6
    seems to be an issue with what's inside `CO11700T.csv` not so much with size – mtoto Jan 20 '16 at 10:24
  • 3
    Hi your best bet is to have a look at line 2557 maybe using bash or similar e.g. `head -2557 C011700.csv | tail -1` and cut it or edit it manually – Stephen Henderson Jan 20 '16 at 10:26
  • 2
    @StephenHenderson given the path it looks like Windows so they might have trouble with `head` and `tail`... (yeah, install cygwin) – Spacedman Jan 20 '16 at 10:50
  • @Spacedman good point. I believe you can install coreutils inc. `head` and `tail` on Windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm maybe easier than cygwin? – Stephen Henderson Jan 20 '16 at 11:16
  • 1
    I had a similar issue the other day and fixed the file in notepad++ (but it was only a few hundred meg). If you don't want to install cygwin etc, perhaps one of these editors will open your files? http://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files – Liesel Jan 20 '16 at 11:47
  • Thanks. I downloaded GVIM to edit the file and found the problem in line 2557. Turns out ampersand was encoded as & with the problem being the semicolon, which is my seperator. I changed it to &, and ran the code again. Of course it just pointed out the next line that contains the same error. So right now I'm trying to do a search and replace all with GVIM, which sounds like something that should be easy, but it's just not doing it. At least I found a way to work on the problem now - just need to try a bit more. – Morten Nielsen Jan 20 '16 at 14:16
  • OK, gave up on GVIM because I couldn't get it to find and replace all occurences of the problem. Moved on to a trial version of SlickEdit, which did the job. Had to search and replace a bunch of different encodings containing the semicolon, then had to locate (via R's error message) a bunch of other faulty lines, that had a semicolon missing. Finally I was able to read it in less than a minute using fread. Thanks for all the suggestions - they were all correct in pointing me to solve the problem by editing the file. I'd be happy to approve yours @LesH as the correct answer as it helped me out – Morten Nielsen Jan 20 '16 at 22:37
  • That's kind, but it was a group effort and you figured out the final solution yourself. As much as I'd like the rep, you should answer your own question http://stackoverflow.com/help/self-answer – Liesel Jan 20 '16 at 23:16

1 Answers1

1

Solved this by installing SlickEdit and using it to edit the lines that caused the trouble. A few characters like ampersand, quotation marks, and apostrophes were consistently encoded to include semicolon - e.g. & instead of just &. As semicolon was the seperator in the text document, this caused the problem in reading with fread.

Morten Nielsen
  • 325
  • 2
  • 4
  • 19