8

I have a large csv that I load as follows

df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])

I get several errors during the loading process.

  1. First, if I dont specify warn_bad_lines=True,error_bad_lines=False I get:

    Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24

  2. Second, if I use the options above, I now get:

    CParserError: Error tokenizing data. C error: EOF inside string starting at line 32357585

Question is: how can I have a look at these bad lines to understand what's going on? Is it possible to have read_csv return these bogus lines?

I tried the following hint (Pandas ParserError EOF character when reading multiple csv files to HDF5):

from pandas import parser

try:
  df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail

but still get

Error tokenizing data. C error: Expected 22 fields in line 329867, saw 24

Community
  • 1
  • 1
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 2
    Did you check first answer in this? could it be special characters? http://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5 – kosa Aug 11 '16 at 17:37
  • yep. let me add that in the question – ℕʘʘḆḽḘ Aug 11 '16 at 17:37
  • 1
    What is the data at line 32357585? That may give some clues and check pandas version too, https://github.com/pydata/pandas/issues/11654 – kosa Aug 11 '16 at 17:38
  • 1
    yeah well that is the problem. how can I read this data line? – ℕʘʘḆḽḘ Aug 11 '16 at 17:41
  • 1
    If it is CSV, open it in CSV (assuming windows box) (or) use some other file reading API's to first print & understand the data, once you know what is there, then you can try to find a work around for that using pandas. – kosa Aug 11 '16 at 17:43
  • 1
    problem is: data is too big. I need to use pandas here. there must be a way – ℕʘʘḆḽḘ Aug 11 '16 at 17:44
  • 1
    The answer I was referring few comments ago was related to quoting. quoting=csv.QUOTE_NONE (not sure if this is applicable in your version) – kosa Aug 11 '16 at 17:44
  • 1
    Try with `low_memory=False` in `read_csv`. – Kartik Aug 11 '16 at 18:28
  • I think the `low_memory` option does not do anything actually, dont you think? – ℕʘʘḆḽḘ Aug 11 '16 at 18:29
  • 1
    Also, pass this `names=range(24)` to force `read_csv` to use 24 columns from the beginning. – Kartik Aug 11 '16 at 18:30
  • 1
    To solve the EOF in string, try quoting options available with `read_csv` – Kartik Aug 11 '16 at 18:32
  • 2
    Pandas reads the first few lines, determines dtypes and then reads the rest of the data in that determined dtypes. Sometimes that causes misinterpretation of strings. `low_memory` will just cause pandas to create the dtypes after reading all of the data. It causes data duplication in memory. But you're right, it probably won't help your situation. How about the other suggestions? – Kartik Aug 11 '16 at 18:36
  • thanks let me try them. I do not, however, know how to actually see these bad lines. isnt that crazy? there should be a way – ℕʘʘḆḽḘ Aug 11 '16 at 18:37
  • 1
    try using csv module, use try/except... where try does nothing. except prints bad line. – Merlin Aug 11 '16 at 18:59
  • good idea. can you code that up? – ℕʘʘḆḽḘ Aug 11 '16 at 19:00

3 Answers3

2

i'll will give my answer in two parts:

part1: the op asked how to output these bad lines, to answer this we can use python csv module in a simple code like that:

import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here

with open(file) as f_obj:
    for line_number, row in enumerate(csv.reader(f_obj)):
        if line_number > max(lines_set):
            break
        elif line_number in lines_set: # put your bad lines numbers here
            print(line_number, row)

also we can put it in more general function like that:

import csv


def read_my_lines(file, lines_list, reader=csv.reader):
    lines_set = set(lines_list)
    with open(file) as f_obj:
        for line_number, row in enumerate(csv.reader(f_obj)):
            if line_number > max(lines_set):
                break
            elif line_number in lines_set:
                print(line_number, row)


if __name__ == '__main__':
    read_my_lines(file='your_filename.csv', lines_list=[100, 200])

part2: the cause of the error you get:

it's hard to diagnose problem like this without a sample of the file you use. but you should try this ..

pd.read_csv(filename)

is it parse the file with no error ? if so, i will explain why.

the number of columns is inferred from the first line.

by using skiprows and header=0 you escaped the first 3 rows, i guess that contains the columns names or the header that should contains the correct number of columns.

basically you constraining what the parser is doing.

so parse without skiprows, or header=0 then reindexing to what you need later.

note:

if you unsure about what delimiter used in the file use sep=None, but it would be slower.

from pandas.read_csv docs:

sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'

link

Sameh Farouk
  • 549
  • 4
  • 8
0

In my case, adding a separator helped:

data = pd.read_csv('/Users/myfile.csv', encoding='cp1251', sep=';')
Tiger-222
  • 6,677
  • 3
  • 47
  • 60
Marina
  • 11
0

We can get line number from error and print line to see what it looks like

Try:

import subprocess
import re
from pandas import parser

try:
  filename='mydata.tsv'
  df=pd.read_csv(filename,sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail
  err=re.findall(r'\b\d+\b', detail) #will give all the numbers ['22', '329867', '24'] line number is at index 1
  line=subprocess.check_output("sed -n %s %s" %(str(err[1])+'p',filename),stderr=subprocess.STDOUT,shell=True) # shell command 'sed -n 2p filename'  for printing line 2 of filename
  print 'Bad line'
  print line # to see line 
Yugandhar Chaudhari
  • 3,831
  • 3
  • 24
  • 40