Problems reading CSV file with commas and characters in pandas

Question

I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. Since these are user provided, there are many special characters which are entered by mistake as well. The issue is that I cannot open the csv file using pandas read_csv. It shows error:Cparser, error tokenizing data. Can someone help me with reading the csv file into pandas?

To speed the process, can you post a few example lines from the file which are giving you trouble? — DSM, Jan 27 '13 at 18:13
Is the tags field quoted? If not you are going to have some difficulty — Wes McKinney, Jan 27 '13 at 18:22
pandas._parser.CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 8 The 3rd column in the Tag field is a comma. The tag fields are not quoted. Is there a workaround without quoting the Tag column? — user1992696, Jan 27 '13 at 18:23
Urf. IIRC your columns are "Tag, User, Quality, Cluster_id", yes? Do the other three behave (no unquoted commas)? If so, then we can salvage it by looping over each line, taking the last three, and saying that everything else should go into the Tag field. — DSM, Jan 27 '13 at 18:28
Yes, that is true, the columns are as you mentioned.The user is a URI eg: http://xyz.nl/user_001. Cluster_id just contains values from 1 -500. Quality has :good, bad, usefulness-useful, usefulness-not_useful etc. Only the tags field contains cells with , and cells which contain words like 17th,red,flower in one cell. These cells cause the problem — user1992696, Jan 27 '13 at 18:32

score 9 · Accepted Answer · answered Jan 27 '13 at 18:49

9

Okay. Starting from a badly formatted CSV we can't read:

>>> !cat unquoted.csv
1950's,xyz.nl/user_003,bad, 123
17th,red,flower,xyz.nl/user_001,good,203
"",xyz.nl/user_239,not very,345
>>> pd.read_csv("unquoted.csv", header=None)
Traceback (most recent call last):
  File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
    pd.read_csv("unquoted.csv", header=None)
[...]
  File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6

We can make a nicer version, taking advantage of the fact the last three columns are well-behaved:

import csv

with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for line in reader:
        newline = [','.join(line[:-3])] + line[-3:]
        writer.writerow(newline)

which produces

>>> !cat quoted.csv
1950's,xyz.nl/user_003,bad, 123
"17th,red,flower",xyz.nl/user_001,good,203
,xyz.nl/user_239,not very,345

and then we can read it:

>>> pd.read_csv("quoted.csv", header=None)
                 0                1         2    3
0           1950's  xyz.nl/user_003       bad  123
1  17th,red,flower  xyz.nl/user_001      good  203
2              NaN  xyz.nl/user_239  not very  345

I'd look into fixing this problem at source and getting data in a tolerable format, though. Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair.

answered Jan 27 '13 at 18:49

DSM

342,061
65
592
494

Hi, Thank you for the solution. Could you explain what this particular line does? newline = [','.join(line[:-3])] + line[-3:] – user1992696 Jan 27 '13 at 19:41
1

`line[:-3]` is a list with all elements of the line except the last three. `','.join(some_sequence)` uses the string `","` -- a comma -- to combine them. This is because if you put `print line` inside the inner loop, you can see that the CSV reader didn't know not to split `17th,red,flower`, and so I have to recombine it into one term. The brackets `[]` make this a one-element list. The second term, `line[-3:]`, means 'all the elements of the list starting three from the end`. So really it's just "make a new list with the first element recombined from everything but the last three." – DSM Jan 27 '13 at 19:45
I tried the above code, but for me, I get the same infile as outfile.(I do not get the quoted tags) In my infile, there are Tag fields with just "," "#" etc. Do you think that is causing the problem? – user1992696 Jan 27 '13 at 19:53
I really need to see some examples of the troublesome cases to say more. – DSM Jan 27 '13 at 20:01
When I run the code, it does not give an error, but just reproduces the infile to the outfile. Some examples of Tags are [,],[*man],[12a44],[17thcentury, flower, red], [1920's],[19th century,painting], [3/4 angle][age?]. These are mainly user entered tags for online painting collection. Some tags are just commas and also contain mix of special characters. – user1992696 Jan 27 '13 at 20:07
I just tried those as tags and it worked just fine. Stick `print repr(line)` and `print repr(newline)` in before the `writerows` command to see what it's doing. – DSM Jan 27 '13 at 20:11
I was trying to print them and this is what I get: line ->['zen', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] newline->['zen', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] But if I change newline = [','.join(line[:-2])] + line[-3:], then I get line ->['zen', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] newline ->['zen,http://steve.nl/user_4027', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] – user1992696 Jan 27 '13 at 20:17
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/23457/discussion-between-dsm-and-user1992696) – DSM Jan 27 '13 at 20:21

Problems reading CSV file with commas and characters in pandas

1 Answers1

Linked