achieve json_normalize in pyspark

Question

I have the data in json, which looks like this:

[{"state": "Florida",
         "shortname": "FL",
         "info": {"governor": "Rick Scott"},
         "counties": [{"name": "Dade",
                       "population": 12345,
                       "Attributes": [
                         {
                          "capture_date": "2020-01-29",
                          "Spirit_code": "TRLQR",
                          "value": 1
                         },
                         {
                            "capture_date": "2020-01-29",
                            "Spirit_code": "HAVPN",
                            "value": 57000
                        }

                       ]},
                       {"name": "Broward",
                        "population": 40000,
                         "Attributes": [
                         {
                            "capture_date": "2020-01-29",
                            "Spirit_code": "GMSTP",
                            "value": 14
                        },
                        {
                            "capture_date": "2020-01-29",
                            "Spirit_code": "GWTPN",
                            "value": 11212
                        }
                       ]
                       },
                       {"name": "Palm Beach",
                        "population": 60000,
                        "Attributes": [{
                            "capture_date": "2020-01-29",
                            "Spirit_code": "YGHMN",
                            "value": 154.01
                        },
                        {
                            "capture_date": "2020-01-29",
                            "Spirit_code": "CXZASD",
                            "value": 154.01
                        }]
                       }
         ]},
        {"state": "Ohio",
         "shortname": "OH",
         "info": {"governor": "John Kasich"},
         "counties": [{"name": "Summit", "population": 1234,
                      "Attributes": [{
                            "capture_date": "2020-01-29",
                            "Spirit_code": "QWERTY",
                            "value": 154.01
                        },
                        {
                            "capture_date": "2020-01-29",
                            "Spirit_code": "JKLGH",
                            "value": 154.01
                        }]
         },
                      {"name": "Cuyahoga", "population": 1337,
                      "Attributes": [{
                            "capture_date": "2020-01-29",
                            "Spirit_code": "ASDF",
                            "value": 154.01
                        },
                        {
                            "capture_date": "2020-01-29",
                            "Spirit_code": "POIUY",
                            "value": 154.01
                        }]

                      }],
        }
]

I am getting the result: using:

json_normalize(data["data"], ["counties", "Attributes"], ["state", "shortname", ["counties", "name"], ["counties", "population"]])

How can we achieve the result of json_normalize of pandas using Pyspark?

The desired output should be in the normalized form, the same thing can be achieved using pandas, but I am clueless how can we achieve the same result using the pyspark?

state,   shortname, name,       population, attirbute.capture_date, attirbute.spirit_code, attirbute.value
florida, FL        ,Dade,       12345     , 2020-0-29             , TRLQR                , 1
florida, FL        ,Dade,       12345     , 2020-0-29             , HAVPN                , 57000
florida, FL        ,Broward,    40000     , 2020-0-29             , GMSTP                , 14
florida, FL        ,Broward,    40000     , 2020-0-29             , GWTPN                , 11212
florida, FL        ,Palm Beach, 60000     , 2020-0-29             , YGHMN                , 154.01
florida, FL        ,Palm Beach, 60000     , 2020-0-29             , YGHMN                , 154.01
florida, FL        ,Palm Beach, 60000     , 2020-0-29             , CXZASD                , 154.01

what is the input, what is a desired output? one of these is missing — michalrudko, Feb 10 '20 at 18:48
I was trying to read the above data into a json file, but it seems to be incorrect. It returns "unexpected } found". Once I have more time I'll try to find where is the issue or please double-check it on your end as well. — michalrudko, Feb 13 '20 at 10:24
first, make sure that you have correct valid json string. use a function like json.dump which prettifies the json string. Then you can use panda.json_normalise function. output of this function can be fed to spark.createDataframe in order to help return a pyspark dataframe. let me know if you have any questions. — ronakvp, Nov 07 '22 at 19:49

achieve json_normalize in pyspark

0 Answers0