read_csv() & EOF character in string cause parsing issue

Question

I am trying to read in 50 csv files from a zip file but keep getting

CParserError: Error tokenizing data. C error: EOF inside string starting at line 166 I know there is an error with reading a particular string within the data and can fix in manually but dont want to have to extract all csv files manually to fix each one.

with zipfile.ZipFile('C:\Users\Austen\Anaconda\cs109_final\CA34.zip') as zf:
   for name in zf.namelist():
      container[name] = pd.read_csv(zf.open(name))

The problem I found is that there is a single ; in each csv file towards the end of the file. How would I ignore that?

With reference from:

https://github.com/pydata/pandas/issues/5500

Tried to add

    container[name] = pd.read_csv(zf.open(name),skipfooter=4)

But I get 'unexpected end of data'

I seem to recall someone using read_table to get around this. Worth a shot imo — Bob Haffner, Nov 26 '14 at 21:24
If the `;` character occurs in a predictable or detectable location, you could first manually read in each archive member file, clean it up, and then pass the sanitized version on the `pd.read_csv()`. If the uncompressed csv member files aren't too big, you could do all this processing in memory very quickly. — martineau, Nov 26 '14 at 21:32
It is always in the third to last line, but the size of each file varies. is there a way to search the file delete that line and then read the updated file in a dataframe? I don't want to use the last 3 lines in the dataframe anyway — Austen Novis, Nov 26 '14 at 21:35
Are there possibly other `;` characters in the files you'd want to preserve? Also it's unclear what a `;` has to do with an EOF character. — martineau, Nov 26 '14 at 21:40
It is just that one towards the end of the file. How would I read the files in and delete the last 3 lines and then save it as a dataframe? — Austen Novis, Nov 26 '14 at 21:41
To ignore the last 3 lines of a file, you'd first have to determine how many lines were in the file, and the read it a second time from the beginning and stop when you get to the third-from-the-end line. — martineau, Nov 26 '14 at 21:45
@AustenNovis I just tried to recreate this and couldn't. I'm using 0.15.1. What version are you using? — Bob Haffner, Nov 26 '14 at 21:47
Using '0.12.0'. Dataframes are from the pandas library. Is the best way to get length by summing every row? — Austen Novis, Nov 26 '14 at 21:48
@AustenNovis Ok, upgrading might help. Also, I just read some of your other comments regarding deleting the last 3 lines. You can use the skipfooter argument in read_csv() skipfooter : int, default 0 Number of lines at bottom of file to skip (Unsupported with engine=’c’) — Bob Haffner, Nov 26 '14 at 22:24
yikes.. I see you went through the trouble of finding the row count. Instead of skiprows maybe try nrows. nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files — Bob Haffner, Nov 26 '14 at 22:40
Sorry, I'm not familiar with the pandas library. However I have an idea about how to clean up the files without have to read through each one of them twice. The result could be written to a temp file, which could then be passed on to `pd.read_csv()`. LMK if you think something like that would an acceptable solution and I'll give it a shot and post an answer. — martineau, Nov 26 '14 at 23:13
Some parting thoughts on this 1) try using the nrows argument. 2) copy one of those CSVs from the zip file and place in a plain old folder and try read_csv() again. Maybe the compression with the zip file is causing problems with read_csv(). 3) Upgrade Pandas. You are a few versions behind. — Bob Haffner, Nov 26 '14 at 23:28
Cool! Wow, that was a weird one. That skipfooter error might be worth filing. I wonder if you had any CSVs that were less than 4 lines? — Bob Haffner, Nov 27 '14 at 00:05

score 6 · Answer 1 · answered Apr 24 '15 at 20:46

6

Would adding an option to read_csv fix the problem? I had a similar problem and it was fixed by adding the option quoting=csv.QUOTE_NONE

For example:

df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

The second comment in this discussion talks about why: https://github.com/pydata/pandas/issues/5500

answered Apr 24 '15 at 20:46

Selah

7,728
9
48
60

Note: You will have to `import csv` before – Carsten Sep 27 '19 at 11:02

score 2 · Answer 2 · answered Jul 17 '17 at 09:53

2

Passing engine="python" solves the issue.

Reference:Most frequent errors

answered Jul 17 '17 at 09:53

Kondalarao V

790
5
3

read_csv() & EOF character in string cause parsing issue

2 Answers2

Linked