36

I'm reading a basic csv file where the columns are separated by commas with these column names:

userid, username, body

However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8

Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?

David
  • 1,454
  • 3
  • 16
  • 27

5 Answers5

37

Imagine we're reading your dataframe called comma.csv:

userid, username, body
01, n1, 'string1, string2'

One thing you can do is to specify the delimiter of the strings in the column with:

df = pd.read_csv('comma.csv', quotechar="'")

In this case strings delimited by ' are considered as total, no matter commas inside them.

Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122
  • 3
    I don't think his string has quotes, because if it did then pandas would recognize it without `quotechar` – Leb Sep 23 '15 at 15:44
  • Like Leb said, I don't have quotes around the body – David Sep 23 '15 at 15:45
  • 1
    @David Ok got it. Your dataframe has been provided "as is", or you can do some preprocessing on it? – Fabio Lamanna Sep 23 '15 at 15:48
  • @Fabio: dataframe is an output of a hive SQL query. I then replace the default tab delimiter with ",". I tried to avoid replacing the tab delimiter and keeping it as is, trying to do `pd.read_csv(...., sep='\t')` but it turns out that some `body` fields have tabs in them as well, so it's all just a mess. – David Sep 23 '15 at 15:52
  • 2
    @David have a look at [this](http://stackoverflow.com/q/14550441/2699288) question, hope that can help you. – Fabio Lamanna Sep 23 '15 at 15:58
  • 1
    @David Why no import straight from the SQL, maybe something like this http://stackoverflow.com/questions/12047193/how-to-convert-sql-query-result-to-pandas-data-structure – Leb Sep 23 '15 at 15:59
  • Thank you to both of you for your feedback. I figured it out with @Fabio's suggested question. – David Sep 24 '15 at 13:17
  • 3
    I tried this example, but does not work for me. Python 3, pandas 0.18.0 reads `comma.csv` but `01` become the index `1`, and `username` become only `'string1` :( – ragesz May 06 '16 at 13:50
  • Does NOT work in pandas 1.5.3 It parses the first column as index and the comma separated qouted string as two columns – Karel Marik Apr 24 '23 at 17:30
18

Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.

In my case:

n = 5 #define yours
df = pd.read_csv(file,
                 usecols=range(n),
                 lineterminator='\n',
                 header=None)
Ilyas
  • 1,976
  • 15
  • 9
5

for me none of the above code samples worked (I was working on Netflix Prize dataset on Kaggle) but there is actually one cool feature from pandas version 1.3.0+ which an on_bad_lines parameter that let you use a callback function. Here is what I did:

def manual_separation(bad_line):
    right_split = bad_line[:-2] + [",".join(bad_line[-2:])] # All the "bad lines" where all coming from the same last column that was containing ","
    return right_split

filename = "netflix_movie_titles.csv"
df = pd.read_csv(
        filename, 
        header=None,
        encoding="ISO-8859-1",
        names = ['Movie_Id', 'Year', 'Name'], 
        on_bad_lines=manual_separation,
        engine="python",
    )

Works like a charm! Your only obligation is to use engine=python. Hope that helps!

Antoine Krajnc
  • 1,163
  • 10
  • 29
1

Does this help?

import csv
with open("csv_with_commas.csv", newline='', encoding = 'utf8') as f:
    csvread = csv.reader(f)
    batch_data = list(csvread)
    print(batch_data)

Reference:

[1] https://stackoverflow.com/a/40477760/6907424

[2] To combat "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 157: character maps to undefined": https://stackoverflow.com/a/9233174/6907424

hafiz031
  • 2,236
  • 3
  • 26
  • 48
0

First, I didn't find anything to resolve "comma inside qoutes, issue" systematically and properly. The pandas=1.5.3 is unable to parse it correctly. Tried to specify parameters like qoutechar, quoting, escapechar, lineterminator, ...

Finally, found two workaround solutions taking advantage I know the comma could be in the last column only. Assume following content of csv

userid, username, body
1, Joe, string1
2, Jim, "string21, string22"

If you don't mind the part after 3rd comma is lost then specify number of columns

pd.read_csv(r'c:\TEMP\to_parse.csv',usecols=range(3))

which yields

   userid  username        body
0       1       Joe     string1
1       2       Jim   "string21

Second workaround is more complicated but it yields complete string with comma. The principle is to replace first 2 commas by semicolon (you must know the number of columns)

with open(path, 'r') as f:
    fo = io.StringIO()
    data = f.readlines()
    fo.writelines(u"" + line.replace(';', ':').replace(',', ';', 2) for line in data)
    fo.seek(0)
df = pd.read_csv(fo, on_bad_lines='warn', sep=';')

Perhaps it could be accomplished by regex as well.

Karel Marik
  • 811
  • 8
  • 13
  • Hi @karel-marik, you might want to try adding the `skipinitialspace=True` option to `read_csv` (see [here](https://stackoverflow.com/a/8311951/10201982)). This helped me to get an example like yours to work with pandas. – astoeriko Jul 12 '23 at 17:11