1

I have a csv as follows:

customer_id,date_time,product
1,"2018-10-08 00:00:00",[]
2,"2018-03-26 00:00:00","["apple","orange"]"

As one can see column 4 is messy in the sense that if that field has no text in then the square bracket would not be wrapped in double quotes.

Anywho, my problem is that when importing with pandas:

df = pd.read_csv('df.csv', sep=',')

I am presented with the error message:

ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

I am confident pandas is confused with the comma between "apple" and "orange" despite they reside in the same column. I have found: Python Pandas Error tokenizing data where the solution:

data = pd.read_csv('file1.csv', error_bad_lines=False)

is suggested. However, this is not viable in my case as this would affect too many rows. I am new to Python. In the past the following in R would have imported with no problem:

df <- read.csv(file.choose(), stringsAsFactors = FALSE)
halfer
  • 19,824
  • 17
  • 99
  • 186
Sweepy Dodo
  • 1,761
  • 9
  • 15
  • 1
    Do you have access to how the csv is created? I think this is just one of those problems that you can't fix with a magical read_csv argument because you irreversibly messed up the formatting the same way that you would if you tried to do `bad_str = "here is a "quote""`. – ALollz Jan 24 '20 at 17:47
  • It was from a SQL query. However, the query was not the cause. The data itself was [] and ["apple","orange"]. However, when exported into csv it came out like so – Sweepy Dodo Jan 24 '20 at 17:53
  • 1
    imho you're going to have to preprocess the cvs data first, use regex to convert `"[` & `]"` to just square brackets, before loading the data into a df. Or, if you have control over the query code, do it there. – deadvoid Jan 24 '20 at 17:55
  • 1
    Yes, So i'd advise re-doing that `.to_csv` call to perhaps use a different default character for quoting. I.e. `quotechar="'"` And then use the same when you read it back. But since you have `lists` really I'd think about using some file format that can serialize python objects (perhaps use `DataFrame.to_pickle()`). Otherwise you're going to have a headache with lists – ALollz Jan 24 '20 at 17:56
  • @deadvoid I used this method, however, Pandas was still confused with the comma in ["apple","orange"]. As such, the double quotes were not the cause. Still, appreciate the input – Sweepy Dodo Jan 24 '20 at 19:19
  • @T.Fung Do you have control over the export to CSV, though? – AMC Jan 24 '20 at 19:19
  • @AMC SQL query, yes. Export, no. The export was done via the GUI in MYSQL workbench – Sweepy Dodo Jan 24 '20 at 19:22

1 Answers1

2

Try reading the data using Python's csv module to help you parse the data. As an example:

import csv

import pandas as pd

with open('test.csv','r') as f:
    csv_reader = csv.reader(f, quotechar = None)
    headers = next(csv_reader)
    csv_list = list(csv_reader)

concatenated_csv_list = [i[:2]+[','.join(i[2:])] for i in csv_list]

This sorts out the weird commas as delimiters in your data file assuming the format is consistent where the additional commas only appear from the 2 indexed field onwards.

So once you have a consistent list:

concatenated_csv_list
>>
[['1', '"2018-10-08 00:00:00"', '[]'],
 ['2', '"2018-03-26 00:00:00"', '"["apple","orange"]"']]

You can transform it into a dataframe by reading the nested list and its headers.

pd.DataFrame(data = concatenated_csv_list, columns = headers)
BernardL
  • 5,162
  • 7
  • 28
  • 47
  • As it turns out we need another package. Solution works. I have been parsing your script. As said, I am new to Python. I get your intuition, however, this part [i[:2]+[','.join(i[2:])] for i in csv_list] will take me sometime to parse and understand it to its entirety. Thank you. – Sweepy Dodo Jan 24 '20 at 19:16
  • The `concatenated_csv_list` is basically a list comprehension that is taking the first values as is and then joining the 2nd indexed onward together. – BernardL Jan 25 '20 at 01:09
  • Spent sometime taking it apart I now know why it works. A very new way of coding to me. 1 question if I may. Why doesn't the following work? [for i in csv_list i[:2]+[','.join(i[2:])] ] – Sweepy Dodo Jan 25 '20 at 17:22
  • The method used here is called a list comprehension which follows a syntactic construct, your example is basically in the wrong syntax. Just google it to find out more. – BernardL Jan 31 '20 at 14:54