0

I've seen different variations of this question, but no one seemed to be exactly why I need.

I have a CSV file with 4 columns, double quotes enclosing every field, as the sample bellow:

"user_id","artistname","trackname","playlistname"
"7511e45f2cc6f6e609ae46c15506538c","Glenn Gould",""Kyllikki" - Three Lyric Pieces for Piano, Op. 41 - II. Andantino","Instrumenal - Home Listens"

As it can be seen in example above, there is simultaneously a quoted word and a comma character in the trackname field, both unescaped. I expect to have the following result:

user_id artistname trackname playlistname
0 7511e45f2cc6f6e609ae46c15506538c Glenn Gould "Kyllikki" - Three Lyric Pieces for Piano, Op. 41 - II. Andantino Instrumenal - Home Listens

But I'm receving a CSV error when the example line is reached, Error tokenizing data. C error: Expected 4 fields in line 14735, saw 5

That's how I'm reading the file:

df = pd.read_csv(
    'path_to/my_file.csv',
    sep=',',
    quoting=csv.QUOTE_ALL,
    quotechar='"',
    doublequote=False
)

Is there a way to read this file without preprocessing it? If not, what preprocessing should be done, noting that this is a huge file?

baileythegreen
  • 1,126
  • 3
  • 16
  • 1
    You'll have to preprocess somehow... that's not valid CSV... To have a quote in a field, it needs to be escaped with a quote... so you really should have `"""Kyllikki""..."` (and you'd want `doublequote=True`... in fact... all of those options you're supplying are either the default or not what you want... you can literally just use `pd.read_csv(filename)` her – Jon Clements Apr 02 '22 at 19:44
  • 1
    I'd probably start off trying to ascertain how big the problem is... before doing your `pd.read_csv` create an empty list `lines_to_check = []` and then add `on_bad_lines=lines_to_check.append` to your `pd.read_csv(...)`... that'll get the valid rows in your DF and then you can look at `lines_to_check` to see what the damage is... – Jon Clements Apr 02 '22 at 19:48

1 Answers1

1

I had the same issue, changing parsing engine solved the problem. Just add engine='python' to the pd.read_csv() command.

See https://stackoverflow.com/a/43586874/21271392

Michel
  • 11
  • 2