1

I am trying to read in 50 csv files from a zip file but keep getting

CParserError: Error tokenizing data. C error: EOF inside string starting at line 166 I know there is an error with reading a particular string within the data and can fix in manually but dont want to have to extract all csv files manually to fix each one.

with zipfile.ZipFile('C:\Users\Austen\Anaconda\cs109_final\CA34.zip') as zf:
   for name in zf.namelist():
      container[name] = pd.read_csv(zf.open(name))

The problem I found is that there is a single ; in each csv file towards the end of the file. How would I ignore that?

With reference from:

https://github.com/pydata/pandas/issues/5500

Tried to add

    container[name] = pd.read_csv(zf.open(name),skipfooter=4) 

But I get 'unexpected end of data'

Austen Novis
  • 444
  • 1
  • 12
  • 30
  • I seem to recall someone using read_table to get around this. Worth a shot imo – Bob Haffner Nov 26 '14 at 21:24
  • I get the same error when I use pd.read_table – Austen Novis Nov 26 '14 at 21:25
  • If the `;` character occurs in a predictable or detectable location, you could first manually read in each archive member file, clean it up, and then pass the sanitized version on the `pd.read_csv()`. If the uncompressed csv member files aren't too big, you could do all this processing in memory very quickly. – martineau Nov 26 '14 at 21:32
  • It is always in the third to last line, but the size of each file varies. is there a way to search the file delete that line and then read the updated file in a dataframe? I don't want to use the last 3 lines in the dataframe anyway – Austen Novis Nov 26 '14 at 21:35
  • Are there possibly other `;` characters in the files you'd want to preserve? Also it's unclear what a `;` has to do with an EOF character. – martineau Nov 26 '14 at 21:40
  • It is just that one towards the end of the file. How would I read the files in and delete the last 3 lines and then save it as a dataframe? – Austen Novis Nov 26 '14 at 21:41
  • To ignore the last 3 lines of a file, you'd first have to determine how many lines were in the file, and the read it a second time from the beginning and stop when you get to the third-from-the-end line. – martineau Nov 26 '14 at 21:45
  • @AustenNovis I just tried to recreate this and couldn't. I'm using 0.15.1. What version are you using? – Bob Haffner Nov 26 '14 at 21:47
  • Using '0.12.0'. Dataframes are from the pandas library. Is the best way to get length by summing every row? – Austen Novis Nov 26 '14 at 21:48
  • @AustenNovis Ok, upgrading might help. Also, I just read some of your other comments regarding deleting the last 3 lines. You can use the skipfooter argument in read_csv() skipfooter : int, default 0 Number of lines at bottom of file to skip (Unsupported with engine=’c’) – Bob Haffner Nov 26 '14 at 22:24
  • with skipfooter I get the error 'unexpected end of data' – Austen Novis Nov 26 '14 at 22:29
  • yikes.. I see you went through the trouble of finding the row count. Instead of skiprows maybe try nrows. nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files – Bob Haffner Nov 26 '14 at 22:40
  • Sorry, I'm not familiar with the pandas library. However I have an idea about how to clean up the files without have to read through each one of them twice. The result could be written to a temp file, which could then be passed on to `pd.read_csv()`. LMK if you think something like that would an acceptable solution and I'll give it a shot and post an answer. – martineau Nov 26 '14 at 23:13
  • Some parting thoughts on this 1) try using the nrows argument. 2) copy one of those CSVs from the zip file and place in a plain old folder and try read_csv() again. Maybe the compression with the zip file is causing problems with read_csv(). 3) Upgrade Pandas. You are a few versions behind. – Bob Haffner Nov 26 '14 at 23:28
  • Cool! Wow, that was a weird one. That skipfooter error might be worth filing. I wonder if you had any CSVs that were less than 4 lines? – Bob Haffner Nov 27 '14 at 00:05

2 Answers2

6

Would adding an option to read_csv fix the problem? I had a similar problem and it was fixed by adding the option quoting=csv.QUOTE_NONE

For example:

df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

The second comment in this discussion talks about why: https://github.com/pydata/pandas/issues/5500

Selah
  • 7,728
  • 9
  • 48
  • 60
2

Passing engine="python" solves the issue.

Reference:Most frequent errors

Kondalarao V
  • 790
  • 5
  • 3