I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. Since these are user provided, there are many special characters which are entered by mistake as well. The issue is that I cannot open the csv file using pandas read_csv. It shows error:Cparser, error tokenizing data. Can someone help me with reading the csv file into pandas?
Asked
Active
Viewed 1.3k times
8
-
2To speed the process, can you post a few example lines from the file which are giving you trouble? – DSM Jan 27 '13 at 18:13
-
Is the tags field quoted? If not you are going to have some difficulty – Wes McKinney Jan 27 '13 at 18:22
-
pandas._parser.CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 8 The 3rd column in the Tag field is a comma. The tag fields are not quoted. Is there a workaround without quoting the Tag column? – user1992696 Jan 27 '13 at 18:23
-
Urf. IIRC your columns are "Tag, User, Quality, Cluster_id", yes? Do the other three behave (no unquoted commas)? If so, then we can salvage it by looping over each line, taking the last three, and saying that everything else should go into the Tag field. – DSM Jan 27 '13 at 18:28
-
Yes, that is true, the columns are as you mentioned.The user is a URI eg: http://xyz.nl/user_001. Cluster_id just contains values from 1 -500. Quality has :good, bad, usefulness-useful, usefulness-not_useful etc. Only the tags field contains cells with , and cells which contain words like 17th,red,flower in one cell. These cells cause the problem – user1992696 Jan 27 '13 at 18:32
1 Answers
9
Okay. Starting from a badly formatted CSV we can't read:
>>> !cat unquoted.csv
1950's,xyz.nl/user_003,bad, 123
17th,red,flower,xyz.nl/user_001,good,203
"",xyz.nl/user_239,not very,345
>>> pd.read_csv("unquoted.csv", header=None)
Traceback (most recent call last):
File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
pd.read_csv("unquoted.csv", header=None)
[...]
File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6
We can make a nicer version, taking advantage of the fact the last three columns are well-behaved:
import csv
with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
newline = [','.join(line[:-3])] + line[-3:]
writer.writerow(newline)
which produces
>>> !cat quoted.csv
1950's,xyz.nl/user_003,bad, 123
"17th,red,flower",xyz.nl/user_001,good,203
,xyz.nl/user_239,not very,345
and then we can read it:
>>> pd.read_csv("quoted.csv", header=None)
0 1 2 3
0 1950's xyz.nl/user_003 bad 123
1 17th,red,flower xyz.nl/user_001 good 203
2 NaN xyz.nl/user_239 not very 345
I'd look into fixing this problem at source and getting data in a tolerable format, though. Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair.

DSM
- 342,061
- 65
- 592
- 494
-
Hi, Thank you for the solution. Could you explain what this particular line does? newline = [','.join(line[:-3])] + line[-3:] – user1992696 Jan 27 '13 at 19:41
-
1`line[:-3]` is a list with all elements of the line except the last three. `','.join(some_sequence)` uses the string `","` -- a comma -- to combine them. This is because if you put `print line` inside the inner loop, you can see that the CSV reader didn't know not to split `17th,red,flower`, and so I have to recombine it into one term. The brackets `[]` make this a one-element list. The second term, `line[-3:]`, means 'all the elements of the list starting three from the end`. So really it's just "make a new list with the first element recombined from everything but the last three." – DSM Jan 27 '13 at 19:45
-
I tried the above code, but for me, I get the same infile as outfile.(I do not get the quoted tags) In my infile, there are Tag fields with just "," "#" etc. Do you think that is causing the problem? – user1992696 Jan 27 '13 at 19:53
-
-
When I run the code, it does not give an error, but just reproduces the infile to the outfile. Some examples of Tags are [,],[*man],[12a44],[17thcentury, flower, red], [1920's],[19th century,painting], [3/4 angle][age?]. These are mainly user entered tags for online painting collection. Some tags are just commas and also contain mix of special characters. – user1992696 Jan 27 '13 at 20:07
-
I just tried those as tags and it worked just fine. Stick `print repr(line)` and `print repr(newline)` in before the `writerows` command to see what it's doing. – DSM Jan 27 '13 at 20:11
-
I was trying to print them and this is what I get: line ->['zen', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] newline->['zen', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] But if I change newline = [','.join(line[:-2])] + line[-3:], then I get line ->['zen', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] newline ->['zen,http://steve.nl/user_4027', 'http://steve.nl/user_4027', 'usefulness-useful', '500'] – user1992696 Jan 27 '13 at 20:17
-
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/23457/discussion-between-dsm-and-user1992696) – DSM Jan 27 '13 at 20:21