4

I am using a Dataset from Standford (see Dev Set 2.0). This file is in JSON format. When I read the file, it is a dictionary, but I changed it to a DF:

import json
json_file = open("dev-v2.0.json", "r")
json_data = json.load(json_file)
json_file.close()

df = pd.DataFrame.from_dict(json_data)
df = df[0:2] # for this example, only a subset

All the information that I need is in the df['data'] column. Within every row, there is so many data, in this format:

{'title': 'Normans', 'paragraphs': [{'qas': [{'question': 'In what country is Normandy located?', 'id': '56ddde6b9a695914005b9628', 'answers': [{'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}], 'is_impossible': False}, {'question': 'When were the Normans in Normandy?', 'id': '56ddde6b9a695914005b9629', 'answers': [{'text': '10th and 11th centuries', 'answer_start': 94}, {'text': 'in the 10th and 11th centuries', 'answer_start': 87}

I want to query all the Questions and the Answeres, from all the rows in the DF. So ideally, the output is like this:

Question                                         Answer 
'In what country is Normandy located?'          'France'
'When were the Normans in Normandy?'            'in the 10th and 11th centuries'

Sorry in advance! I have read the 'Good example' post. But I found it hard to produce reproducible data for this example, since it looks like it is a dictionary, with a list inside, within the list a small dictionary, within that another dictionary, then again a dictionary... when I use print(df["data"]), it is only printing a small subset...(which is not helping to reproduce this problem).

print(df['data'])
0    {'title': 'Normans', 'paragraphs': [{'qas': [{...
1    {'title': 'Computational_complexity_theory', '...
Name: data, dtype: object

Many thanks in advance!

R overflow
  • 1,292
  • 2
  • 17
  • 37
  • 1
    I think this is more a data quality issue and falls under the umbrella "how to parse a nested json file". Have a look at [these](https://stackoverflow.com/questions/19729710/parsing-nested-json-data) types of questions. Since you need to parse the json file before you load it into a pandas dataframe. – Erfan Oct 07 '19 at 11:54
  • Take a look at [json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html). – error Oct 07 '19 at 12:01

2 Answers2

1

This should get you started.

Wasn't sure how to handle situations when the answer field is empty, so you might want to come up with a better solution. Example:

"question": " After 1945, what challenged the British empire?", "id": "5ad032b377cf76001a686e0d", "answers": [], "is_impossible": true

import json
import pandas as pd 


with open("dev-v2.0.json", "r") as f:
    data = json.loads(f.read())

questions, answers = [], []

for i in range(len(data["data"])):
    for j in range(len(data["data"][i]["paragraphs"])):
        for k in range(len(data["data"][i]["paragraphs"][j]["qas"])):
            q = data["data"][i]["paragraphs"][j]["qas"][k]["question"]
            try: # only takes first element since the rest of values are duplicated?
                a = data["data"][i]["paragraphs"][j]["qas"][k]["answers"][0]["text"]
            except IndexError: # when `"answers": []`
                a = "None"

            questions.append(q)
            answers.append(a)

d = {
    "Questions": questions,
    "Answers": answers
}

pd.DataFrame(d)

                                               Questions                      Answers
0                   In what country is Normandy located?                       France
1                     When were the Normans in Normandy?      10th and 11th centuries
2          From which countries did the Norse originate?  Denmark, Iceland and Norway
3                              Who was the Norse leader?                        Rollo
4      What century did the Normans first gain their ...                 10th century
...                                                  ...                          ...
11868  What is the seldom used force unit equal to on...                       sthène
11869           What does not have a metric counterpart?                         None
11870  What is the force exerted by standard gravity ...                         None
11871  What force leads to a commonly used unit of mass?                         None
11872        What force is part of the modern SI system?                         None

[11873 rows x 2 columns]
help-ukraine-now
  • 3,850
  • 4
  • 19
  • 36
1

The following page (SQuAD (Stanford Q&A) json to Pandas DataFrame) deals with converting dev-v1.1.json to DataFrame.

kelidas
  • 81
  • 3