Split Nested JSON into Equal Sized Files PySpark/Python

Question

I am using pyspark that produces a nested json that looks like below :

{
    "batch_key": 1,
    "client_key": 1,
    "client_name": "ABC",
       "Claims": [
        {
            "claim_key": "A",
            "client_key": "B",
            "client_name": "ATT"
           
        },
        {
            "claim_key": "B",
            "client_key": "B",
            "client_name": "ATT"
           
        }
    ]
}

but Ideally it should be divided into equal parts, like below:

{
    "batch_key": 1,
    "client_key": 1,
    "client_name": "ABC",
       "Claims": [
        {
            "claim_key": "A",
            "client_key": "B",
            "client_name": "ATT"
           
        }
       
    ]
}


{
    "batch_key": 1,
    "client_key": 1,
    "client_name": "ABC",
       "Claims": [
        {
            "claim_key": "B",
            "client_key": "B",
            "client_name": "ATT"
           
        }
    ]
}

The actual json payload would be much bigger, hence the above split is needed so that API can consume it properly. Is there a way to achieve the above using sparksql/pyspark/python?

ggordon · Answer 1 · 2020-09-16T20:17:07.943

For each batch record you could extract the claims, map over the claims to create the multiple batches per claim and then call a flatten on the result to flatten the result.

For example assuming you have a stream/RDD of batches


batches = batches
.map( lambda batch :   [{
    "batch_key": batch[ "batch_key"],
    "client_key": batch["client_key"],
    "client_name": batch["client_name"],
    "Claims": [ claim ]
} for claim in batch["Claims"] ] )
.flatten()

Based on your version of python and the number of attributes/keys you have in each JSON record you may consider different options to merge or create your new dictionary in the claims map - see Merging dictionaries in Python 2 and 3

Split Nested JSON into Equal Sized Files PySpark/Python

1 Answers1