8

I am trying to read this dataset from Kaggle: Amazon sales rank data for print and kindle books

The file amazon_com_extras.csv has a column named "Title" that sometimes contains a comma ',' so all the fields in this .csv are enclosed by quotation marks:

"ASIN","GROUP","FORMAT","TITLE","AUTHOR","PUBLISHER"
"022640014X","book","hardcover","The Diversity Bargain: And Other Dilemmas of Race, Admissions, and Meritocracy at Elite Universities","Natasha K. Warikoo","University Of Chicago Press"

I have read other questions related to this problem but none of them solve it. For example, I have tried:

df = pd.read_csv("amazon_com_extras.csv",engine="python",sep=',')
df = pd.read_csv("amazon_com_extras.csv",engine="python",sep=',',quotechar='"')

But nothing seems to work. I am using Python 3.7.2 and pandas 0.24.1.

davidaap
  • 1,569
  • 1
  • 18
  • 43
  • Tried loading the same file. The issues is not the commas, but a line where there are extra `"` in the title field (e.g line 808 where it contains: "Girl in Glass: How My "Distressed Baby" Defied the Odds). These should have been escaped as "". – John Sloper Mar 05 '19 at 20:44
  • I fixed it using `doublequotes=False`. – vaugusto Jul 09 '23 at 03:11

3 Answers3

14

This is happening to you because there are fields inside the document that contain unescaped quotes inside the quoted text.

I am not aware of a way to instruct the csv parser to handle that without preprocessing.

If you don't care about those columns, you can use

pd.read_csv("amazon_com_extras.csv", engine="python", sep=',', quotechar='"', error_bad_lines=False)

That will disable the Exception from being raised, but it will remove the affected lines (you will see that in the console).

An example of such a line:

"1405246510","book","hardcover",""Hannah Montana" Annual 2010","Unknown","Egmont Books Ltd"

Notice the quotes.

Instead, a more standard dialect of csv would have rendered:

1405246510,"book","hardcover","""Hannah Montana"" Annual 2010","Unknown","Egmont Books Ltd"

You can, for example, load the file with Libreoffice and re-save it as CSV again to get a working CSV dialect or use other preprocessing techniques.

ssice
  • 3,564
  • 1
  • 26
  • 44
  • Thanks, now I have an issue when I try to process this file in chunks: https://stackoverflow.com/questions/55011726/error-bad-lines-false-does-not-work-when-using-chunksize If you could help me that would be great. – davidaap Mar 05 '19 at 21:18
  • To add on top of the accepted answer and expand the question a bit, sometimes we saw the error because the separation mark is not ',', especially if you save your file in csv format. Make sure you double check that before proceed. If that is the case, try to save it in .xls format -- this may solve the problem as well. – CathyQian Apr 23 '19 at 18:30
8

The problem is that pandas treats the char " for queting, and expects " after every " in a cell, which doesn't happen in this csv.

To make pandas not treat it as a quoting mark, pass the parameter quoting=3 inside the pd.read_csv function.

Aryerez
  • 3,417
  • 2
  • 9
  • 17
  • Will get the error away but would lead to broken csv results. – Taher A. Ghaleb Mar 15 '21 at 18:20
  • @TaherAhmedGhaleb Are you sure you meant to write the comment to me? Because my solution is not ignoring the error, but treating the " char as a regular char, in a csv file that appears to use it just as one. – Aryerez Mar 16 '21 at 20:00
  • I didn't say you ignore the error. It works but messes up your data. – Taher A. Ghaleb Mar 16 '21 at 20:35
  • @TaherAhmedGhaleb I didn't mess up my data. If your data got messed up, it was probably messed up from the beginning. – Aryerez Mar 19 '21 at 09:48
2

This works for me Sniffer:

import requests
import csv
with open('spotify_dataset.csv') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(14734))


df = pd.read_csv('spotify_dataset.csv', engine='python', dialect=dialect, error_bad_lines=False)
alpoza
  • 460
  • 4
  • 7