I have some json that's being read from a file where each row looks something like this:
{
"id": "someGuid",
"data": {
"id": "someGuid",
"data": {
"players": {
"player_1": {
"id": "player_1",
"locationId": "someGuid",
"name": "someName",
"assets": {
"assetId1": {
"isActive": true,
"playlists": {
"someId1": true,
"someOtherId1": false
}
},
"assetId2": {
"isActive": true,
"playlists": {
"someId1": true
}
}
}
},
"player_2": {
"id": "player_2",
"locationId": "someGuid",
"name": "someName",
"dict": {
"assetId3": {
"isActive": true,
"playlists": {
"someId1": true,
"someOtherId1": false
}
},
"assetId4": {
"isActive": true,
"playlists": {
"someId1": true
}
}
}
}
}
},
"lastRefreshed": "2020-01-23T19:29:15.6354794Z",
"expiresAt": "9999-12-31T23:59:59.9999999",
"dataSourceId": "someId"
}
}
I'm having difficulty trying to figure out a way using python or sql in pyspark on Azure Databricks to turn this json into a tabular format like this:
+===========+=============+===============+===========+==============+=============+=================+
| Location | Player_ID | Player | Asset_ID | Asset_Active | Playlist_ID | Playlist_Status |
+===========+=============+===============+===========+==============+=============+=================+
| someId | player_1 | ThisIsAPlayer | anotherId | TRUE | someOtherId | FALSE |
+-----------+-------------+---------------+-----------+--------------+-------------+-----------------+
The challenge is transforming the players property above to multiple rows per location. A location may have any number of players of varying ids. I perhaps would not be asking this question if the property players was an array of player objects instead of a dictionary, but i have no control over the structure of this document, so this is what I must work with. This is a non-issue in something like PowerBI, where the manipulation of the data is more straight forward.
The farthest I've been able to get is doing something like this:
df = spark.read.json(filePath).select("data.id", "data.lastRefreshed", "data.expiresAt","data.dataSourceId","data.data.players.*")
But this results in a dataframe/table that expands all the nested structs undernearth players to columns. I've scoured SO looking for someone with a similar situation, but no luck.
How do I go about exploding/expanding the players column in this dataframe to separate rows?
In pyspark, I'm dealing with Spark 2.4.3