1

Many questions and answers explain how to load a json file from a URL into a pandas data frame, using low level methods for example JSON to pandas DataFrame . I managed to make it work but if possible, I would prefer to use the simple pandas.io.json method to load the data frame directly from a url. The problem is that pandas.io.json.read_json returns a ValueError, which other answers such as ValueError: Expected object or value when reading json as pandas dataframe say can be fixed by setting the orient argument, but what I would need is a different argument that would make it possible to specify that the data frame is one level down in the nested json at the "results" level.

# Sample content of the json file
# {
#     "more": false,
#     "results": [
#         {
#             "id": "all",
#             "text": "All"
#         },
#         {
#             "ID":"4",
#             "text": "Afghanistan"
#         },
#         {
#             "id": "975",
#             "text": "ASEAN"
#         }
#     ]
# }

import urllib.request
import json
import pandas

url_reporter = "https://comtrade.un.org/Data/cache/reporterAreas.json"
url_partner = "https://comtrade.un.org/Data/cache/partnerAreas.json"

# Failed attempt with pandas.io
pandas.io.json.read_json(url_reporter)

# ~/.local/lib/python3.7/site-packages/pandas/io/json/_json.py in _parse_no_numpy(self)
#    1138         if orient == "columns":
#    1139             self.obj = DataFrame(
# -> 1140                 loads(json, precise_float=self.precise_float), dtype=None
#    1141             )
#    1142         elif orient == "split":
#
# ValueError: Expected object or value

pandas.io.json.read_json(url_partner)
# Returns a fattened data frame with json as the cell content in a "results" column

# Successful attempt using lower level packages
req = urllib.request.Request(url=url_reporter)
with urllib.request.urlopen(req) as response:
    json_content = json.load(response)
    df = pandas.json_normalize(json_content['results'])


In [34]: df
Out[34]:
      id                      text
0    all                       All
1      0                     World
2      4               Afghanistan
3    472  Africa CAMEU region, nes
4      8                   Albania
..   ...                       ...
289  879         Western Asia, nes
290  732            Western Sahara
291  887                     Yemen
292  894                    Zambia
293  716                  Zimbabwe

[294 rows x 2 columns]
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110

1 Answers1

1

There is a BOM character in url_reporter response. Set the encoding like this and you should get your result:

df = pd.io.json.read_json(url_reporter, encoding='utf-8-sig')

Since the whole processing is done in the pd.io.json.read_json method, we cannot select the keys to direct to the actual data that we are after. So you need to run this additional code to get your desired results:

df = pd.json_normalize(df.results.to_list())
Babak Fi Foo
  • 926
  • 7
  • 17
  • Thank you that fixes the read issue, but the data frame is still formatted with a `more` and a `results` column. `results` is a character column containing `{'id': '4', 'text': 'Afghanistan'}`. Is there a way to tell `read_json` to load the `id` and `text` columns directly instead? – Paul Rougieux Aug 31 '21 at 10:43
  • 1
    From URL side there is nothing that could be done. However, after importing the data you can clean it up. I will add the post processing to the solution. @PaulRougieux You can find the solution now. – Babak Fi Foo Aug 31 '21 at 13:37