I've seen different variations of this question, but no one seemed to be exactly why I need.
I have a CSV file with 4 columns, double quotes enclosing every field, as the sample bellow:
"user_id","artistname","trackname","playlistname"
"7511e45f2cc6f6e609ae46c15506538c","Glenn Gould",""Kyllikki" - Three Lyric Pieces for Piano, Op. 41 - II. Andantino","Instrumenal - Home Listens"
As it can be seen in example above, there is simultaneously a quoted word and a comma character in the trackname
field, both unescaped. I expect to have the following result:
user_id | artistname | trackname | playlistname | |
---|---|---|---|---|
0 | 7511e45f2cc6f6e609ae46c15506538c | Glenn Gould | "Kyllikki" - Three Lyric Pieces for Piano, Op. 41 - II. Andantino | Instrumenal - Home Listens |
But I'm receving a CSV error when the example line is reached, Error tokenizing data. C error: Expected 4 fields in line 14735, saw 5
That's how I'm reading the file:
df = pd.read_csv(
'path_to/my_file.csv',
sep=',',
quoting=csv.QUOTE_ALL,
quotechar='"',
doublequote=False
)
Is there a way to read this file without preprocessing it? If not, what preprocessing should be done, noting that this is a huge file?