Pandas: How to workaround "error tokenizing data"?

Question

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed the useful one, please let me know.

I simply would like to read a CSV file with pandas into a dataframe. Sounds like a simple task.

My file Test.csv

1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4

My code:

import pandas as pd
df = pd.read_csv('Test.csv',header=None)

My error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6

My guess about the issue is that Pandas looks to the first line and expects the same number of tokens in the following rows. If this is not the case it will stop with an error.

In the numerous answers, the suggestions for using options are, e.g.: error_bad_lines=False or header=None or skiprows=3 and more non-helpful suggestions.

However, I don't want to ignore any lines or skip. And I don't know in advance how many columns and rows the datafile has.

So it basically boils down to how to find the maximum number of columns in the datafile. Is this the way to go? I hoped that there was an easy way to simply read a CSV file which does not have the maximum column number in the first line. Thank you for any hints. I'm using Python 3.6.3, Pandas 0.24.1 on Win7.

Seems to be the same issue someone had yesterday: https://stackoverflow.com/questions/55129640/read-csv-into-a-dataframe-with-varying-row-lengths-using-pandas. Either read in full lines and split after, alter the original file adding a header row with an excessive amount of columns just to be safe, or use the `csv` module — ALollz, Mar 15 '19 at 18:14
thanks for the link. I will check. I don't want to change the datafile, I simply want to read it. — theozh, Mar 15 '19 at 18:18
Yes, the upvoted answers in that SO link should help solve this problem. I like the solution with `pd.read_fwf('path_to_csv', header=None)` as it just needs the csv path. — amanb, Mar 15 '19 at 18:21
the `fwf` solution is very case specific and can break very easily since the underlying data isn't actually fixed-width delimited. — ALollz, Mar 15 '19 at 18:31
@amanb, in my case `df = pd.read_fwf('Test.csv', delimiter=',', header=None)` and `print(df)` does not show the expected result. — theozh, Mar 15 '19 at 18:32
The reason why the pandas csv support seems lacking is that the input file is not a csv. Looks like one, but csv records are expected to have the same sequence of fields per record. The one above does not. Therefore, don't be shy about pre-processing the input data to obtain a csv. https://en.wikipedia.org/wiki/Comma-separated_values#Specification — Rich Andrews, Mar 15 '19 at 20:57

score 29 · Accepted Answer · answered Mar 16 '19 at 07:44

Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.

So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)

### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
import pandas as pd

df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code

df contains empty string '' for the missing entries at the beginning and the middle, and None for the missing tokens at the end.

   0  1  2  3     4     5     6
0  1  2  3  4     5  None  None
1  1  2  3  4     5     6  None
2        3  4     5  None  None
3  1  2  3  4     5     6     7
4     2     4  None  None  None

If you write this again to a file via:

df.to_csv("Test.tab",sep="\t",header=False,index=False)

1   2   3   4   5       
1   2   3   4   5   6   
        3   4   5       
1   2   3   4   5   6   7
    2       4

None will be converted to empty string '' and everything is fine.

The next level would be to account for data strings in quotes which contain the separator, but that's another topic.

1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7

actually how would you account for data strings that contain the separator? — Apostolis Kennedy, Jun 04 '20 at 14:28
I cannot tell right away, I haven't tested. Have you? But there is probably a way with a "smart" `split()` command, maybe with regular expressions... — theozh, Jun 04 '20 at 14:54
I found a way in another post, where I read the csv with csv.reader and it has some parameters. I used 2 of them: quotechar = '"' (it's a double quote inside single quotes or you can do the other way around if needed) and skipinitialspace = True. I am not 100% sure if skip initial space was needed in my csv buy hery, it works... — Apostolis Kennedy, Jun 05 '20 at 11:46

Rich Andrews · Answer 2 · 2019-03-17T14:56:45.980

Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses.

import pandas as pd
import csv

not_csv = """1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
"""

with open('not_a.csv', 'w') as csvfile:
    csvfile.write(not_csv)

d = []
with open('not_a.csv') as csvfile:
    areader = csv.reader(csvfile)
    max_elems = 0
    for row in areader:
        if max_elems < len(row): max_elems = len(row)
    csvfile.seek(0)
    for i, row in enumerate(areader):
        # fix my csv by padding the rows
        d.append(row + ["" for x in range(max_elems-len(row))])

df = pd.DataFrame(d)
print df

# the default engine
# provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
#df = pd.read_csv('Test.csv',header=None, engine='c')

# the python csv engine
# provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
#df = pd.read_csv('Test.csv',header=None, engine='python')

Preprocess file outside of python if concerned about extra code inside python creating too much python code.

Richs-MBP:tmp randrews$ cat test.csv
1,2,3
1,
2
1,2,
,,,
Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print $1","$2","$3","$4","$5}' < test.csv
1,2,3,,
1,,,,
2,,,,
1,2,,,
,,,,

Thank you for your suggestion and pointing out that my example actually is not "strictly correct" CSV data. I prefer to keep it short and besides using Pandas not using another module which probably blows up my final executable further. — theozh, Mar 16 '19 at 07:45
Also consider processing the file outside of python if concerned about code bulk, separation of concerns, make someone else do the preprocessing, etc. - 'awk' is an excellent tool for doing what you need. I'll append it to my answer. — Rich Andrews, Mar 17 '19 at 14:53
thank your for the addition. Unless absolute necessary, I prefer to use the least possible number of tools, modules, external programs, pre-processing steps etc. I haven't use awk, but heard several times of it. — theozh, Mar 19 '19 at 16:56

score 1 · Answer 3 · answered Mar 17 '19 at 16:17

I have a different take on the solution. Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer.

Tokenizer

def tokenize(str):
    idx = [x for x, v in enumerate(str) if v == '\"']
    if len(idx) % 2 != 0:
        idx = idx[:-1]
    memory = {}
    for i in range(0, len(idx), 2):
        val = str[idx[i]:idx[i+1]+1]
        key = "_"*(len(val)-1)+"{0}".format(i)
        memory[key] = val
        str = str.replace(memory[key], key, 1)        
    return [memory.get(token, token) for token in str.split(",")]

Test cases for Tokenizer

print (tokenize("1,2,3,4,5"))
print (tokenize(",,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))

Output

['1', '2', '3', '4', '5'] ['', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello,,,, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']

Putting the tokenizer into action

with open("test1.csv", "r") as fp:
    lines = fp.readlines()

lines = list(map(lambda x: tokenize(x.strip()), lines))
df = pd.DataFrame(lines).replace(np.nan, '')

Advantage:

Now we can teak the tokenizer function as per our needs

thank you for this suggestion. I will (re-)consider it when there will be absolute need to cover commas in quotation marks. — theozh, Mar 19 '19 at 17:01

score 1 · Answer 4 · answered Oct 10 '19 at 13:36

1

In my case 1 I opened the *.csv in Excel 2 I saved the *.csv as CSV (comma-delimited) 3 I loaded the file in python via:

import pandas as pd
df = pd.read_csv('yourcsvfile.csv', sep=',')

Hope it helps!

answered Oct 10 '19 at 13:36

Corina Roca

385
4
15

1

ok, might work. But sorry, I definitely don't want to additionally involve Excel! – theozh Oct 10 '19 at 15:37

score 1 · Answer 5 · answered Mar 01 '23 at 16:32

1

For me the solution was to add the correct separator ; as follows:

pandas.read_csv(path, sep=';')

answered Mar 01 '23 at 16:32

smoquet

321
3
11

score 0 · Answer 6 · answered Aug 10 '21 at 20:56

0

For me this was solved by adding usecols to the pd.read_csv() command:

usecols=['My_Column_1','My_Column_2',...]

answered Aug 10 '21 at 20:56

sharon

4,406
1
17
10

Pandas: How to workaround "error tokenizing data"?

6 Answers6

Tokenizer

Test cases for Tokenizer

Putting the tokenizer into action

Advantage:

Linked