3

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.

def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):

        df = pd.read_csv(input_file_name,
                         header=None,
                         nrows=size_of_chunk,
                         skiprows=eachRow,
                         low_memory=False,
                         error_bad_lines=False,
                         sep=',')
                         # engine='python'
                         # quoting=csv.QUOTE_NONE
                         # encoding='utf-8'

        df.columns = header_row
        df = df.drop_duplicates(keep='first')
        df = df.apply(lambda x: x.astype(str).str.lower())

        return df

I'm then calling this function within a loop and works just fine.

huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)

I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as

engine='python'

quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why

encoding='utf-8'

but none of it worked, its still throwing the following error

Error:

Traceback (most recent call last):
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
    huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
    sep=',')
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
  File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
  File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
  File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>> 
Wcan
  • 840
  • 1
  • 10
  • 31
  • can you show us a valid row and the invalid row (the second last you have removed) – Indent Oct 19 '17 at 09:00
  • I cannot paste that here it has 171 columns and it looks like normal row but when pandas is reading it, it throws the above mentioned error on the second last line of of file. – Wcan Oct 19 '17 at 09:08

2 Answers2

6

If you are under linux, try to remove all non printable caracter. Try to load your file after this operation.

tr -dc '[:print:]\n' < file > newfile
Indent
  • 4,675
  • 1
  • 19
  • 35
5

I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:

pd.read_csv(file,engine='python', error_bad_lines=False) 

#engine='python' provides a better output

Benkerroum Mohamed
  • 1,867
  • 3
  • 13
  • 19
Carlos Chaccon
  • 131
  • 3
  • 3
  • This also worked for me. Here's another resource that agrees with this answer: https://www.shanelynn.ie/pandas-csv-error-error-tokenizing-data-c-error-eof-inside-string-starting-at-line/ – Brad123 Mar 15 '20 at 21:18