Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Question

Similar to this question, but my CSV has a slightly different format. Here is an example:

id,employee,details,createdAt  
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"  
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"

I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:

id  employee                details    createdAt              Unnamed: 1  Unnamed: 2 
 1      John        {Country":"USA"  Salary:5000           Review:null}"  2018-09-01 
 2     Sarah  {Country":"Australia"  Salary:6000  Review:"Hardworking"}"  2018-09-05

My desired output:

id  employee                                                       details   createdAt
 1      John                 {"Country":"USA","Salary":5000,"Review":null}  2018-09-01 
 2     Sarah  {"Country":"Australia","Salary":6000,"Review":"Hardworking"}  2018-09-05

I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?

the problem is not with quote, it is with comma, while reading csv all the entries separated by comma are considered as next column — Gahan, Sep 08 '18 at 07:36
@Gahan single columns in a CSV can contain commas. The issue probably is the enclosing `"` on the string, causing the commas to be interpreted as new columns rather than part of a dictionary structure — roganjosh, Sep 08 '18 at 07:38
@roganjosh , I tried, the structure is too responsible for it as quote enclosed `"{"` and then `Country` without quote and then `":"` in quote and then `USA"` and comma encountered which interpreted it as next column value — Gahan, Sep 08 '18 at 07:41
I suspect it can only be solved with regex, which rules me out of helping sorry :/ — roganjosh, Sep 08 '18 at 07:42
I completely agree with that, either it needs to have supported format of input or must be parsed with regex — Gahan, Sep 08 '18 at 07:43
Instead of trying to parse this, you should rather not use a mix of two badly-interacting metaformats (CSV, JSON) to write the data in the first place. Just use JSON all the way as a default. If you must use this, you need to escape quotes. — Ulrich Eckhardt, Sep 08 '18 at 07:48

Richard Rublev · Answer 1 · 2018-09-08T07:47:37.213

0

I have reproduced your file With

   df = pd.read_csv('e1.csv', index_col=None )

print (df)

Output

     id    emp                                            details      createdat
0   1   john    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01" 
1   2  sarah  "{"Country":"Australia", "Salary":6000,"Review...   "2018-09-05"

edited Sep 08 '18 at 07:47

answered Sep 08 '18 at 07:43

Richard Rublev

7,718
16
77
121

2

could you specify what more tricks you did, tried your solution and it throws exception: `ParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6` – Gahan Sep 08 '18 at 07:45
Hi, I tried using `header=None` but it still gives me the same result. – hotchocolate Sep 08 '18 at 07:48
`index_col` is `None` by default so your above solution is equivalent to what the OP has already tried. – Jarad Sep 08 '18 at 09:45

score 0 · Answer 2 · answered Sep 08 '18 at 10:07

I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.

Here is a less than optimal option:

df = pd.read_csv('s083838383.csv', sep='@#$%^', engine='python')
header = df.columns[0]
print(df)

Why sep='@#$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.

df looks like this:

                       id,employee,details,createdAt
0  1,John,"{"Country":"USA","Salary":5000,"Review...
1  2,Sarah,"{"Country":"Australia", "Salary":6000...

Then you could use str.extract to apply regex and expand the columns:

result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
                                expand=True).applymap(str.strip)

result.columns = header.strip().split(',')
print(result)

result is:

  id employee                                            details     createdAt
0  1     John    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01"
1  2    Sarah  "{"Country":"Australia", "Salary":6000,"Review...  "2018-09-05"

If you need the starting and ending quotes stripped off of the details string values, you could do:

result['details'] = result['details'].str.strip('"')

If the details object items needs to be a dicts instead of strings, you could do:

from json import loads
result['details'] = result['details'].apply(loads)

score 0 · Accepted Answer · answered Sep 08 '18 at 11:13

As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:

import pandas as pd

data = []

with open("e1.csv") as f_input:
    for row in f_input:
        row = row.strip()
        split = row.split(',', 2)
        rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
        data.append(split[0:2] + rsplit)

df = pd.DataFrame(data[1:], columns=data[0])
print(df)

This would display your data as:

  id employee                                            details   createdAt
0  1     John      {"Country":"USA","Salary":5000,"Review":null}  2018-09-01
1  2    Sarah  {"Country":"Australia", "Salary":6000,"Review"...  2018-09-05

Yes I'm doing something similar to that: read the file as raw text, then do modification so that the JSON format is readable by pandas. Thanks! — hotchocolate, Sep 08 '18 at 12:28

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

3 Answers3