Pandas read_csv with double quotes and separator simultaneously

Question

I've seen different variations of this question, but no one seemed to be exactly why I need.

I have a CSV file with 4 columns, double quotes enclosing every field, as the sample bellow:

"user_id","artistname","trackname","playlistname"
"7511e45f2cc6f6e609ae46c15506538c","Glenn Gould",""Kyllikki" - Three Lyric Pieces for Piano, Op. 41 - II. Andantino","Instrumenal - Home Listens"

As it can be seen in example above, there is simultaneously a quoted word and a comma character in the trackname field, both unescaped. I expect to have the following result:

	user_id	artistname	trackname	playlistname
0	7511e45f2cc6f6e609ae46c15506538c	Glenn Gould	"Kyllikki" - Three Lyric Pieces for Piano, Op. 41 - II. Andantino	Instrumenal - Home Listens

But I'm receving a CSV error when the example line is reached, Error tokenizing data. C error: Expected 4 fields in line 14735, saw 5

That's how I'm reading the file:

df = pd.read_csv(
    'path_to/my_file.csv',
    sep=',',
    quoting=csv.QUOTE_ALL,
    quotechar='"',
    doublequote=False
)

Is there a way to read this file without preprocessing it? If not, what preprocessing should be done, noting that this is a huge file?

You'll have to preprocess somehow... that's not valid CSV... To have a quote in a field, it needs to be escaped with a quote... so you really should have `"""Kyllikki""..."` (and you'd want `doublequote=True`... in fact... all of those options you're supplying are either the default or not what you want... you can literally just use `pd.read_csv(filename)` her — Jon Clements, Apr 02 '22 at 19:44
I'd probably start off trying to ascertain how big the problem is... before doing your `pd.read_csv` create an empty list `lines_to_check = []` and then add `on_bad_lines=lines_to_check.append` to your `pd.read_csv(...)`... that'll get the valid rows in your DF and then you can look at `lines_to_check` to see what the damage is... — Jon Clements, Apr 02 '22 at 19:48

Michel · Answer 1 · 2023-02-23T08:21:07.643

1

I had the same issue, changing parsing engine solved the problem. Just add engine='python' to the pd.read_csv() command.

See https://stackoverflow.com/a/43586874/21271392

edited Feb 23 '23 at 08:21

answered Feb 23 '23 at 08:19

Michel

11
2

Pandas read_csv with double quotes and separator simultaneously

1 Answers1