-1

I am trying to import JSON data including nested data, into a Pandas DataFrame. It is illustrated here using a nice little public dataset that I found here. Let's focus our attention on the header 'owner' for the purpose of this illustration. Note that I'm explicitly casting the data to str. If you print the response data from response.content it has the proper JSON syntax with double quotes.

However, pd.read_json() seems to convert the JSON string under 'owner' into single quotes. Am I doing something wrong here or should this be raised as a dev issue in read_json()? I see a different issue relating to the single/double quotes has been fixed in the past on pandas-dev.

>>> import pandas as pd
>>> import requests
>>> response = requests.get(url='https://api.github.com/users/mralexgray/repos')    
>>> df = pd.read_json(response.content, orient='records', dtype=str)
>>> df['owner'].iloc[1, ]
"{'login': 'mralexgray', 'id': 262517, 'node_id': 'MDQ6VXNlcjI2MjUxNw==', 'avatar_url': 'https://avatars.githubusercontent.com/u/262517?v=4',..."
>>> type(df['owner'].iloc[1, ])
str
>>> response.content
    b'[{"id":6104546,"node_id":"MDEwOlJlcG9zaXRvcnk2MTA0NTQ2","name":"-REPONAME","full_name":"mralexgray/-REPONAME","private":false,"owner":{"login":"mralexgray","id":262517,"node_id":"MDQ6VXNlcjI2MjUxNw==","avatar_url":"https://avatars.githubusercontent.com/u/262517?v=4"...

The only observation I make here is that pd.read_json() may be importing the nested JSON data as a Python dict before it is cast to str.

I'm running Python 3.8.10 with Pandas 1.2.4.

Hawklaz
  • 306
  • 4
  • 20
  • I don't understand the question. What output were you expecting instead, exactly? – Karl Knechtel Jul 12 '21 at 08:47
  • 3
    It seems like you were perhaps expecting to see e.g. `{"foo": "bar"}` instead of `{'foo': 'bar'}`. Here's a hint: what happens when you try just `{"foo": "bar"}` by itself at the Python interpreter prompt? Another hint: what happens when you try `{"foo": "bar"} == {'foo': 'bar'}`? Or `"foo" == 'foo'`? – Karl Knechtel Jul 12 '21 at 08:49
  • 1
    Yes, when the variable is a `dict` as in the case of `{"foo": "bar"}` and `{'foo': 'bar'}` they are equal. And the same goes for the individual strings `'foo'` and `"foo"`. However, my issue is with the content in the JSON string. `"{'foo': 'bar'}"` is not the same as `"{"foo": "bar"}"`. This becomes an issue for me later when trying to later use [`json_normalize()`](https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.json_normalize.html) where I get `json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:` – Hawklaz Jul 12 '21 at 09:51
  • 1
    Yes, those are not equal strings. But right now you *don't have a string* in your Dataframe; you have a dict that represents the parsed JSON. If this somehow causes a problem "using `json_normalize()`", then you're going to have to show us a minimal, *complete* example. As is, what you show us doesn't *illustrate* a problem, even if you say there is one. – Karl Knechtel Jul 12 '21 at 09:55
  • Not really, unless I have missed something here. I have tried to cast it to `str` when I call `pd.read_json()`. If you run `type(df['owner'].iloc[1, ])` it will return `str`. I will edit my question to make this clear. – Hawklaz Jul 12 '21 at 10:08
  • Again, please show a complete example - see https://stackoverflow.com/help/minimal-reproducible-example for guidance. The output you show can't possibly be a real output; if the value really is a `str` then it would have enclosing quotes when displayed. – Karl Knechtel Jul 13 '21 at 17:29
  • Sorry, I had made a mistake when copying the output code by dropping the double quotes - edited now. However, if you run my code you will see the output type as `str`. My code is however reproducible and you will notice the single quotes within the JSON string if you run it yourself. So as I said, I want to see something like, `"{"foo": "bar"}"` instead of `"{'foo': 'bar'}"` – Hawklaz Jul 14 '21 at 04:59
  • I have 2 questions to better understand the context: in your case single quotes are an issue because you are not able to run `json_normalize()`? Do you need to cast the dataframe to `str` or it was just an attempt to solve the problem? – Giulio Mattolin Jul 17 '21 at 14:45
  • Single quotes are an issue because I expected `pd.read_json(dtype=str)` to import the JSON string as it is (i.e., with double quotes). Yes `json_normalize()` become problematic, but my focus here was to question whether the way pd.read_json() works is correct? Given, IMO it should cast it directly to a `str` without first importing it as `dict`. And I tried casting to `str` because expected the double quotes to remain as in the original JSON string. – Hawklaz Jul 18 '21 at 22:59

1 Answers1

3

pd.read_json() may be importing the nested JSON data as a Python dict before it is cast to str.

You are totally right. The dtype str is applied after the json was loaded as a python data structure by the JsonReader class. What you get is the string representation of a dict and not the raw JSON.

import json

d = {"key": "value"}
>>> d
{'key': 'value'}

>>> str(s)  # dtype=str
"{'key': 'value'}"

>>> json.dumps(d)
'{"key": "value"}'

>>> json.loads(json.dumps(d))
{'key': 'value'}

Another example:

d = {"key1": "value1", "key2": {"key21": "value21"}}
l = [d] * 3
>>> df = pd.read_json(json.dumps(l)); print(df)

     key1                  key2
0  value1  {'key21': 'value21'}
1  value1  {'key21': 'value21'}
2  value1  {'key21': 'value21'}

>>> type(df.iloc[0, 1])
dict

>>> df = pd.read_json(json.dumps(l), dtype={'key2': str}); print(df)

     key1                  key2
0  value1  {'key21': 'value21'}  # same display output
1  value1  {'key21': 'value21'}
2  value1  {'key21': 'value21'}

>>> type(df.iloc[0, 1])
str  # but not the same type

To get double quotes, use pd.read_json without dtype and apply json.dumps to dict columns:

df = pd.read_json(json.dumps(l))
df['key2'] = df['key2'].apply(json.dumps)
>>> df

     key1                  key2
0  value1  {"key21": "value21"}
1  value1  {"key21": "value21"}
2  value1  {"key21": "value21"}
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks for the answer. Also, is this behaviour correct for `pd.read_json()`? I would have imagined it should not be casting multiple times thereby essentially destroying information (i.e., the double quotes). Would you say this should be changed in some sort of future release? – Hawklaz Jul 18 '21 at 23:02
  • 1
    JSON is not a valid data type for Python, it have to [translate](https://docs.python.org/3/library/json.html#encoders-and-decoders) the whole file in base type (str, dict, list, ...) first to understand the sense of data. For Python, simple or double quotes are not significant, it's just a visual representation and allowing the serialization (pickle). Besides, if your string contains a simple quote, Python prints it with double quotes to enclose the string. I think this behaviour will never be changed. – Corralien Jul 19 '21 at 04:44