5

I have a '.csv' file containing data about movies and I'm trying to reformat it as a JSON file to use it in MongoDB. So I loaded that csv file to a pandas DataFrame and then used to_json method to write it back. here is how one row in DataFrame looks like:

In [43]: result.iloc[0]
Out[43]: 
title                                                      Avatar
release_date                                                 2009
cast            [{"cast_id": 242, "character": "Jake Sully", "...
crew            [{"credit_id": "52fe48009251416c750aca23", "de...
Name: 0, dtype: object

but when pandas writes it back, it becomes like this:

{   "title":"Avatar",
    "release_date":"2009",
    "cast":"[{\"cast_id\": 242, \"character\": \"Jake Sully\", \"credit_id\": \"5602a8a7c3a3685532001c9a\", \"gender\": 2,...]",
    "crew":"[{\"credit_id\": \"52fe48009251416c750aca23\", \"department\": \"Editing\", \"gender\": 0, \"id\": 1721,...]"
}

As you can see, 'cast' ans 'crew' are lists and they have tons of redundant backslashes. These backslashes appear in MongoDB collections and make it impossible to extract data from these two fields.

How can I solve this problem other than replacing \" with "?

P.S.1: this is how I save the DataFrame as JSON:

result.to_json('result.json', orient='records', lines=True)

UPDATE 1: Apparently pandas is doing just fine and the problem is caused by the original csv files. here is how they look like:

movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3},...]"

I tried to replace "" with " (and I really wanted to avoid this hack):

sed -i 's/\"\"/\"/g'

And of course it caused problems in some lines of data when reading it as csv again:

ParserError: Error tokenizing data. C error: Expected 1501 fields in line 4, saw 1513

So we can conclude it's not safe to do such blind replacement. Any idea?

P.S.2: I'm using kaggle's 5000 movie dataset: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset

ali
  • 117
  • 1
  • 12
  • could you provide an extract of the original .csv file? – Heladio Amaya Feb 02 '20 at 19:49
  • @HeladioAmaya Oh I didn't dig much into the original csv files. That's the problem. data are like: `""cast_id"": 242`. I think the problem is solved. Thank you for pointing it out. I will answer my shameful question as soon as I made sure. – ali Feb 02 '20 at 20:28
  • @HeladioAmaya I couldn't fix it so I added more info to the question. I would appreciate if you take a look again. Any more info needed maybe? – ali Feb 02 '20 at 21:16

2 Answers2

10

I had the same issue : the solution is in 3 steps

1- Data-frame form csv or in my case from xlsx:

 excel_df= pd.read_excel(dataset ,sheet_name=my_sheet_name)

2- convert to json (if you have date in your data)

json_str = excel_df.to_json(orient='records' ,date_format='iso')

3-The most important thing : json.loads **** this is it !

parsed = json.loads(json_str)

4- (facultative) you can write or send the json file : for example : write locally

with open(out, 'w') as json_file:
    json_file.write(json.dumps({"data": parsed}, indent=4 ))

more info : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

Shalil Saleh
  • 101
  • 1
  • 2
  • You hero! Been struggling with outputting json data via flask restful API. the json.loads(df.to_json()) did just the trick! – coross24 Nov 03 '21 at 18:43
5

Pandas is escaping the " character because it thinks the values in the json columns are text. To get the desired behaviour, simply parse the values in the json column as json.

let the file data.csv have the following content (with quotes escaped).

# data.csv
movie_id,title,cast
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}]"

read this into a dataframe, then apply the json.loads function & write out to a file as json.

df = pd.read_csv('data.csv')
df.cast = df.cast.apply(json.loads)
df.to_json('data.json', orient='records', lines=True)

The output is a properly formatted json (extra newlines added by me)

# data.json
{"movie_id":19995,
 "title":"Avatar",
 "cast":[{"cast_id":242,"character":"Jake Sully","credit_id":"5602a8a7c3a3685532001c9a","gender":2,"id":65731,"name":"Sam Worthington","order":0},
         {"cast_id":3,"character":"Neytiri","credit_id":"52fe48009251416c750ac9cb","gender":1,"id":8691,"name":"Zoe Saldana","order":1},
         {"cast_id":25,"character":"Dr. Grace Augustine","credit_id":"52fe48009251416c750aca39","gender":1,"id":10205,"name":"Sigourney Weaver","order":2},
         {"cast_id":4,"character":"Col. Quaritch","credit_id":"52fe48009251416c750ac9cf","gender":2,"id":32747,"name":"Stephen Lang","order":3}]
}
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85