Pandas ParserError EOF character when reading multiple csv files to HDF5

Question

Using Python3, Pandas 0.12

I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:

Traceback (most recent call last):
  File "filter-1.py", line 38, in <module>
    to_hdf()
  File "filter-1.py", line 31, in to_hdf
    for chunk in reader:
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__
    yield self.read(self.chunksize)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
  File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)
  File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)
  File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
  File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done

Edit:

I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error. I also tried passing error_bad_lines=False to pd.read_csv(), but the error persists.

My code is the following:

# -*- coding: utf-8 -*-

import pandas as pd
import os
from glob import glob


def list_files(path=os.getcwd()):
    ''' List all files in specified path '''
    list_of_files = [f for f in glob('2013-06*.csv')]
    return list_of_files


def to_hdf():
    """ Function that reads multiple csv files to HDF5 Store """
    # Defining path name
    path = 'ta_store.h5'
    # If path exists delete it such that a new instance can be created
    if os.path.exists(path):
        os.remove(path)
    # Creating HDF5 Store
    store = pd.HDFStore(path)

    # Reading csv files from list_files function
    for f in list_files():
        # Creating reader in chunks -- reduces memory load
        reader = pd.read_csv(f, chunksize=50000)
        # Looping over chunks and storing them in store file, node name 'ta_data'
        for chunk in reader:
            chunk.to_hdf(store, 'ta_data', mode='w', table=True)

    # Return store
    return store.select('ta_data')
    return 'Finished reading to HDF5 Store, continuing processing data.'

to_hdf()

Edit

If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway. The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.

don't pass the ``mode='w'``; you are truncating the hdf file on each iteration — Jeff, Aug 02 '13 at 12:33
you can try catching the CParserError and just skip that file (until you fix it) — Jeff, Aug 02 '13 at 12:34
Hi Jeff, how do you suggest I catch the CParserError. It's way too cumbersome to check each of the individual files. — Matthijs, Aug 02 '13 at 12:45
first figure out which file it is, don't check, just catch: ``from pandas.io import parser; try: your read_csv look for file f except (parser.CParserError) as detail: print f, detail`` — Jeff, Aug 02 '13 at 12:48
Sorry I don't quite catch your code - I'm rather new to python/pandas. Could you explain a bit further please? — Matthijs, Aug 02 '13 at 12:55

score 155 · Answer 1 · edited Dec 27 '22 at 16:57

155

I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark ('). When I added the option quoting=csv.QUOTE_NONE it fixed my problem.

For example:

import csv
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

edited Dec 27 '22 at 16:57

user_3pij

1,334
11
22

answered Apr 24 '15 at 20:43

Selah

7,728
9
48
60

5

this is an optimal solution – DACW May 29 '17 at 01:19
3

This worked for me, but it would be great if anyone could explain why this works – Vikranth Mar 01 '22 at 19:18

score 31 · Answer 2 · answered Feb 19 '17 at 21:01

31

I have the same problem, and after adding these two params to my code, the problem is gone.

read_csv (...quoting=3, error_bad_lines=False)

answered Feb 19 '17 at 21:01

weefwefwqg3

961
10
23

1

This works like charm. There was an error in one line. After executing with above option I got the following message `Skipping line 192: expected 5 fields, saw 74` – Ayush Vatsyayan Mar 06 '18 at 08:06
This one made me skip too many rows while engine="python",error_bad_lines=False made me skip only one – Vikranth Mar 01 '22 at 19:43

score 18 · Answer 3 · answered Nov 06 '18 at 13:57

I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from @Selah works.

From the csv.py docstring:

    * quoting - controls when quotes should be generated by the writer.
    It can take on any of the following module constants:

    csv.QUOTE_MINIMAL means only when required, for example, when a
        field contains either the quotechar or the delimiter
    csv.QUOTE_ALL means that quotes are always placed around fields.
    csv.QUOTE_NONNUMERIC means that quotes are always placed around
        fields which do not parse as integers or floating point
        numbers.
    csv.QUOTE_NONE means that quotes are never placed around fields.

csv.QUOTE_MINIMAL is the default value and " is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF (end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:

In[4]: import pandas as pd
  ...: from io import StringIO
  ...: test_csv = '''a,b,c
  ...: "d,e,f
  ...: g,h,i
  ...: "m,n,o
  ...: p,q,r
  ...: s,t,u
  ...: '''
  ...: 
In[5]: test = StringIO(test_csv)
In[6]: pd.read_csv(test)
Out[6]: 
                 a  b  c
0  d,e,f\ng,h,i\nm  n  o
1                p  q  r
2                s  t  u
In[7]: test_csv_2 = '''a,b,c
  ...: "d,e,f
  ...: g,h,i
  ...: "m,n,o
  ...: "p,q,r
  ...: s,t,u
  ...: '''
  ...: test_2 = StringIO(test_csv_2)
  ...: 
In[8]: pd.read_csv(test_2)
Traceback (most recent call last):
...
...
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2

The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,r line is actually second.

score 6 · Answer 4 · answered Aug 02 '13 at 13:10

6

Make your inner loop like this will allow you to detect the 'bad' file (and further investigate)

from pandas.io import parser

def to_hdf():

    .....

    # Reading csv files from list_files function
    for f in list_files():
        # Creating reader in chunks -- reduces memory load

        try:

            reader = pd.read_csv(f, chunksize=50000)

            # Looping over chunks and storing them in store file, node name 'ta_data'
            for chunk in reader:
                chunk.to_hdf(store, 'ta_data', table=True)

        except (parser.CParserError) as detail:
             print f, detail

answered Aug 02 '13 at 13:10

Jeff

125,376
21
220
187

1

Hi Jeff, thanks! It works and I did find out what files/lines are causing the problem. Now I can try to 'correct' those files manually, but I would rather have a more programmatic solution. Thus I need to understand what is actually the error I'm being returned and what kind of code do I write that automatically takes care of that problem. – Matthijs Aug 02 '13 at 14:46
you could try specifying a ``lineterminator`` (which is essentially ``\n`` on linux (or ``\n\r`` on windows I think). And at worse you get a bad line (as the invalid terminator is put in the next line).....but need to see what's wrong in the first place: http://pandas.pydata.org/pandas-docs/dev/io.html#csv-text-files – Jeff Aug 02 '13 at 15:01
The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use `glob` to read a bunch of files these files still return me errors. – Matthijs Aug 02 '13 at 15:01
1

that is weird about ``glob``; I personally use something like ``for f in os.listdir(dir); if is_ok(f): process_file(f)``, where ``is_ok`` is a function to accept/reject the filename (or could be other criteria or a ``re.search`` – Jeff Aug 02 '13 at 15:18
Thanks Jeff, I will try something like that. Also I could not specify the lineterminator to be equal to `\n\r`, receiving an error message that only 1-length lineterminators can be used (as also raised here https://github.com/pydata/pandas/issues/3501). – Matthijs Aug 03 '13 at 08:31
that's by definition; you could simply substitute the line endnding in your file to some other character; your file is corrupt somehow – Jeff Aug 03 '13 at 11:28
2

on a side note, I think the first line of code is `from pandas import parser` instead of `from pandas.io import parser`? As the latter cannot work with my pandas 0.15.0 – Yulong Jan 07 '15 at 20:50

score 6 · Answer 5 · answered Dec 05 '17 at 06:36

The solution is to use the parameter engine=’python’ in the read_csv function. The Pandas CSV parser can use two different “engines” to parse a CSV file – Python or C (which is also the default).

pandas.read_csv(filepath, sep=',', delimiter=None, 
            header='infer', names=None, 
            index_col=None, usecols=None, squeeze=False, 
            ..., engine=None, ...)

The Python engine is described to be “slower, but is more feature complete” in the Pandas documentation.

engine : {‘c’, ‘python’}

This is the best answer. Also explained in the https://www.shanelynn.ie/pandas-csv-error-error-tokenizing-data-c-error-eof-inside-string-starting-at-line/ — NellieK, Jan 12 '23 at 21:16

score 2 · Answer 6 · edited Feb 17 '22 at 08:36

2

My error:

ParserError: Error tokenizing data. C error: EOF inside string starting at row 4488'

was resolved by adding delimiter="\t" in my code as:

import pandas as pd
df = pd.read_csv("filename.csv", delimiter="\t")

edited Feb 17 '22 at 08:36

gab

792
1
10
36

answered Feb 17 '22 at 04:59

Sarah Khaleel

33
6

score 2 · Answer 7 · answered Aug 09 '22 at 00:00

Use

engine="python",
error_bad_lines=False,

on the read_csv .

The full call will be like this:

df = pd.read_csv(csvfile, 
                 delimiter="\t", 
                 engine="python",
                 error_bad_lines=False,  
                 encoding='utf-8')

score 0 · Answer 8 · answered Nov 07 '17 at 14:03

For me, the other solutions did not work and caused me quite a headache. error_bad_lines=False still gives the error C error: EOF inside string starting at line. Using a different quoting didn't give the desired results either, since I did not want to have quotes in my text.

I realised that there was a bug in Pandas 0.20. Upgrading to version 0.21 completely solved my issue. More info about this bug, see: https://github.com/pandas-dev/pandas/issues/16559

Note: this may be Windows-related as mentioned in the URL.

This doesn't work - even after upgrading to pandas-0.22.0 I'm getting the same error — Ayush Vatsyayan, Mar 06 '18 at 08:00

score 0 · Answer 9 · answered May 19 '20 at 19:34

After looking up for a solution for hours, I have finally come up with a workaround.

The best way to eliminate this C error: EOF inside string starting at line exception without multiprocessing efficiency reduction is to preprocess the input data (if you have such an opportunity).

Replace all of the '\n' entries in the input file on, for instance ', ', or on any other unique symbols sequence (for example, 'aghr21*&'). Then you will be able to read_csv the data into your dataframe.

After you have read the data, you may want to replace all of your unique symbols sequences ('aghr21*&'), back on '\n'.

score 0 · Answer 10 · edited Feb 18 '22 at 02:04

0

Had similar issue while trying to pull data from a Github repository. Simple mistake, was trying to pull data from the git blob (the html rendered part) instead of the raw csv.

If you're pulling data from a git repo, make sure your link doesn't include a \<repo name\>/blob unless you're specifically interested in html code from the repo.

edited Feb 18 '22 at 02:04

gab

792
1
10
36

answered Feb 26 '21 at 04:11

DonCarleone

544
11
20

score 0 · Answer 11 · edited Aug 18 '23 at 15:46

0

The easiest solution that involves no data loss is to simply:

df = pd.read_csv(nome_do_arquivo, sep=",")

edited Aug 18 '23 at 15:46

doneforaiur

1,308
7
14
21

answered Aug 16 '23 at 20:18

Azo

1
1

Pandas ParserError EOF character when reading multiple csv files to HDF5

11 Answers11

Linked