This question is not a duplicate of PySpark converting a column of type 'map' to multiple columns in a dataframe
The above question is simply to extract values of the map datatypes, my question below is map and one of the columns in there is an dictionary which is no where related to above question.
I have a DynamoDB json file in ORC format, when i read in the file using a SparkSession its schema is
root
|-- item: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I need to select off of this DataFrame some columns which is easy as i could do it using an from_json() item with column name. But the challenge i am facing is while selecting columns, one of the column is a dictionary "init" if when populated i will have to select column values from that column.
My sample data looks like this.
{sequence_number={"n":"15"}, page_id={"s":"ed04a04f-edef-4d40-9266-eb5025c50eb0"}, init={"s":"[{\"type\":1,\"id\":\"CONTACT_CITIZEN_US\",\"name\":\"CONTACT[CITIZEN]\",\"label\":\"UNITED STATES\",\"checked\":true,\"value\":\"US\",\"labelvisibility\":{\"textColor\":\"rgb(51, 51, 51)\",\"backgroundColor\":\"rgba(255, 255, 255, 1)\",\"boxWidth\":105.0625,\"boxHeight\":20,\"parentsDisplayed\":true,\"childrenDisplayed\":true,\"textSize\":\"14px\"},\"fieldvisibility\":{\"boxWidth\":16,\"boxHeight\":16,\"parentsDisplayed\":true,\"childrenDisplayed\":true},\"element_id\":8,\"label_element_id\":9}]"}, client_time={"n":"1506790579036"}, created={"n":"1506790579.1074"}, http_X-Forwarded-For={"s":"108.212.112.22, 10.3.7.56"}, http_User-Agent={"s":"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"}, http_Content-Length={"n":"1125"}, token={"s":"AB8D6195-C3F6-520F-EB5D-FC1CB9BED582"}, execution_time={"n":"6"}}
That 'init' column can have any number of key values in it.
I have tried using spark SQL and functions like explode and try selecting off the column to extract the values i needed but of no luck.
This is what i need when i select
sequence_number page_id type id
15 ed04a04f-edef-.... 1 CONTACT_CITIZEN_US
If you do an from_json() it will take care of the dynamoDB schema of 's' or 'n' within those column values. I am like stuck here, any help would be greatly appreciated.