Extract Pyspark Dataframe's Column value with list of Dictionary

Question

This question is not a duplicate of PySpark converting a column of type 'map' to multiple columns in a dataframe

The above question is simply to extract values of the map datatypes, my question below is map and one of the columns in there is an dictionary which is no where related to above question.

I have a DynamoDB json file in ORC format, when i read in the file using a SparkSession its schema is

root
 |-- item: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

I need to select off of this DataFrame some columns which is easy as i could do it using an from_json() item with column name. But the challenge i am facing is while selecting columns, one of the column is a dictionary "init" if when populated i will have to select column values from that column.

My sample data looks like this.

{sequence_number={"n":"15"}, page_id={"s":"ed04a04f-edef-4d40-9266-eb5025c50eb0"}, init={"s":"[{\"type\":1,\"id\":\"CONTACT_CITIZEN_US\",\"name\":\"CONTACT[CITIZEN]\",\"label\":\"UNITED STATES\",\"checked\":true,\"value\":\"US\",\"labelvisibility\":{\"textColor\":\"rgb(51, 51, 51)\",\"backgroundColor\":\"rgba(255, 255, 255, 1)\",\"boxWidth\":105.0625,\"boxHeight\":20,\"parentsDisplayed\":true,\"childrenDisplayed\":true,\"textSize\":\"14px\"},\"fieldvisibility\":{\"boxWidth\":16,\"boxHeight\":16,\"parentsDisplayed\":true,\"childrenDisplayed\":true},\"element_id\":8,\"label_element_id\":9}]"}, client_time={"n":"1506790579036"}, created={"n":"1506790579.1074"}, http_X-Forwarded-For={"s":"108.212.112.22, 10.3.7.56"}, http_User-Agent={"s":"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"}, http_Content-Length={"n":"1125"}, token={"s":"AB8D6195-C3F6-520F-EB5D-FC1CB9BED582"}, execution_time={"n":"6"}}

That 'init' column can have any number of key values in it.

I have tried using spark SQL and functions like explode and try selecting off the column to extract the values i needed but of no luck.

This is what i need when i select

sequence_number  page_id               type  id
15             ed04a04f-edef-....       1     CONTACT_CITIZEN_US

If you do an from_json() it will take care of the dynamoDB schema of 's' or 'n' within those column values. I am like stuck here, any help would be greatly appreciated.

@user6910411 I just looked at the link you have above but this question has more depth to it, its not same as the above question, please read my comments after the edit. — Krish, Nov 20 '17 at 16:48

Extract Pyspark Dataframe's Column value with list of Dictionary

0 Answers0