Pandas read_csv() conflict with sep and quotechar causing unexpected number of columns

Question

Have looked here and here which come close to the core problem I believe I'm seeing but get fixed in other ways.

I am trying to parse a CSV which has one field which now needs to have a comma in it requiring us to wrap that field in quotes. It is the only field in quotes.

Our delimiter (sep) are commas and we are now adding a string delimiter of quotes (quotechar).

I've boiled it down to this. Seems to me that the order of sep and quotechar application is the key problem causing lines with quotechar in use with a sep in them will never work.

Data file with last line commented out.

$ cat simple.csv
column1,column2, column3
one,    two,                three
one,    two,               "three"
#one,    "two, two_again",   three
$

Code:

df = pd.read_csv( simple_file, sep=',', header=0, comment='#', quotechar='"')
print df

Output:

column1  column2                  column3
0     one      two                    three
1     one      two                 "three"

Now, add the last line which has the sep char in the quoted string.

Data file:

$ cat simple.csv
column1,column2, column3
one,    two,                three
one,    two,               "three"
one,    "two, two_again",   three
$

Output fail:

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4

I believe I want to force Pandas to use the quote delimiter on each line first and then use the separator character as it is doing the opposite. Can't seem to figure out how. Is there a way to tell Pandas this that I can't find ?

Get rid of the spaces or define the spaces as part of the separator. Then the file is readable just by specifying header=None, the defaults take care of the rest. — pvg, Dec 16 '16 at 00:30

score 0 · Accepted Answer · answered Dec 16 '16 at 00:38

0

The pandas CSV reader gets confused because you told it the separator is strictly ',' but you are also using space as a separator in your data file. Either change the separator or fix the data. With the data as

column1,column2, column3
one,two,three
one,two,"three"
one,"two, two_again",three

You get the following

import pandas as pd
print(pd.read_csv("data.csv", header=None))

         0               1         2
0  column1         column2   column3
1      one             two     three
2      one             two     three
3      one  two, two_again     three

answered Dec 16 '16 at 00:38

pvg

2,673
4
17
31

And it looks like I made it worse when I created this little test data file. I assumed Pandas would be smart enough to ignore the white space around the separator yet I just told it exactly what the separator is. Don't have the original at my fingertips at this moment but with my test case eliminating the spaces makes it work as I expected. – Kevin M Dec 16 '16 at 03:11
I can't edit my own comment. I can't edit my own comment on my own question ? – Kevin M Dec 16 '16 at 03:15
@KevinM Panda was smart enough to do exactly what you told it. if the separator is only comma then the space after the comma is part of the next item. then suddenly, you have a quote in the middle of the item which makes no sense and then an extra separator. The parser, sensibly, barfs. You can easily give pandas a regex as a separator although that means it will use the python rather than C parsers - this is slower but may not be relevant in your case. – pvg Dec 16 '16 at 03:44
Exactly what I told it :-) – Kevin M Dec 16 '16 at 21:16

Pandas read_csv() conflict with sep and quotechar causing unexpected number of columns

1 Answers1