Read nested JSON into Dask DataFrame

Question

I am trying to read nested JSON into a Dask DataFrame, preferably with code that'll do the heavy lifting.

Here's the JSON file I am reading:

{
    "data": [{
        "name": "george",
        "age": 16,
        "exams": [{
                "subject": "geometry",
                "score": 56
            },
            {
                "subject": "poetry",
                "score": 88
            }
        ]

    }, {
        "name": "nora",
        "age": 7,
        "exams": [{
                "subject": "geometry",
                "score": 87
            },
            {
                "subject": "poetry",
                "score": 94
            }
        ]
    }]
}

Here is the resulting DataFrame I would like.

name	age	exam_subject	exam_score
george	16	geometry	56
george	16	poetry	88
nora	7	geometry	87
nora	7	poetry	94

Here's how I'd accomplish this with pandas:

df = pd.read_json("students3.json", orient="split")
exploded = df.explode("exams")
pd.concat([exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"])], axis=1)

Dask doesn't have json_normalize, so what's the best way to accomplish this task?

score 2 · Answer 1 · answered Apr 13 '22 at 05:54

If the file contains json-lines, then the most scale-able approach is to use dask.bag and then map the pandas snippet across each bag partition.

If the file is a large json, then the opening/ending brackets will cause problems, so an additional function will be needed to remove them before mapping the text into json.

Rough pseudo-code:

import dask.bag as db

bag = db.read_text("students3.json")

# if there are json-lines 
option1 = bag.map(json.loads).map(pandas_fn)

# if there is a single json
option2 = bag.map(convert_to_jsonlines).map(json.loads).map(pandas_fn)

I couldn't quite get this working, probably cause I'm unfamiliar with the bag API. Take a look at my answer and let me know what you think! — Powers, Apr 14 '22 at 22:08

score 2 · Answer 2 · answered Apr 13 '22 at 06:00

2

Use pd.json_normalize

import json
import pandas as pd

with open('students3.json', 'r', encoding='utf-8') as f:
    data = json.loads(f.read())

df = pd.json_normalize(data['data'], record_path='exams', meta=['name', 'age'])

    subject  score    name age
0  geometry     56  george  16
1    poetry     88  george  16
2  geometry     87    nora   7
3    poetry     94    nora   7

answered Apr 13 '22 at 06:00

Ynjxsjmh

28,441
6
34
52

Thanks, but how can I get this to work with Dask? – Powers Apr 14 '22 at 17:45
@Powers Sorry, didn't notice that it is dask, how about https://stackoverflow.com/q/39721800/10315163? – Ynjxsjmh Apr 14 '22 at 17:47

score 1 · Answer 3 · answered Apr 13 '22 at 23:46

Pydantic offers excellent JSON validation and ingest. Several Pydantic models (one of each 'top level' JSON entry) can be converted to Python dictionaries in a loop to create a list of dictionaries, type: List[Dict], which may be converted to DataFrame objects.

score 1 · Answer 4 · answered Apr 14 '22 at 22:07

I was inspired by the other answers to come up with this solution.

ddf = dd.read_json("students3.json", orient="split")

def pandas_fn(df):
    exploded = df.explode("exams")
    return pd.concat(
        [
            exploded[["name", "age"]].reset_index(drop=True),
            pd.json_normalize(exploded["exams"]),
        ],
        axis=1,
    )

res = ddf.map_partitions(
    lambda df: pandas_fn(df),
    meta=(
        ("name", "object"),
        ("age", "int64"),
        ("subject", "object"),
        ("score", "int64"),
    ),
)

print(res.compute()) gives this output:

     name  age   subject  score
0  george   16  geometry     56
1  george   16    poetry     88
2    nora    7  geometry     87
3    nora    7    poetry     94

This looks good! My only reservation: this will be memory intensive for large json files (forcing one partition per file, see [docs](https://docs.dask.org/en/stable/generated/dask.dataframe.read_json.html#dask.dataframe.read_json)). So if you have a lot of small files, then your approach will work great, if the files are larger than memory, it might not work... The bag approach gives finer-grain control over size of partitions, but requires a bit of a hack to turn json into json-lines... — SultanOrazbayev, Apr 14 '22 at 22:45

Read nested JSON into Dask DataFrame

4 Answers4