Have looked here and here which come close to the core problem I believe I'm seeing but get fixed in other ways.
I am trying to parse a CSV which has one field which now needs to have a comma in it requiring us to wrap that field in quotes. It is the only field in quotes.
Our delimiter (sep) are commas and we are now adding a string delimiter of quotes (quotechar).
I've boiled it down to this. Seems to me that the order of sep and quotechar application is the key problem causing lines with quotechar in use with a sep in them will never work.
Data file with last line commented out.
$ cat simple.csv
column1,column2, column3
one, two, three
one, two, "three"
#one, "two, two_again", three
$
Code:
df = pd.read_csv( simple_file, sep=',', header=0, comment='#', quotechar='"')
print df
Output:
column1 column2 column3
0 one two three
1 one two "three"
Now, add the last line which has the sep char in the quoted string.
Data file:
$ cat simple.csv
column1,column2, column3
one, two, three
one, two, "three"
one, "two, two_again", three
$
Output fail:
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4
I believe I want to force Pandas to use the quote delimiter on each line first and then use the separator character as it is doing the opposite. Can't seem to figure out how. Is there a way to tell Pandas this that I can't find ?