The schema of my data frame is
scala> x.printSchema()
root
|-- pangaea_customer_id: string (nullable = true)
|-- persona_model: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- score: double (nullable = true)
| | |-- tag: string (nullable = true)
|-- process_date: string (nullable = true)
and here is an example row for this database:
x.show(1)
+--------------------+--------------------+-------------+
| pangaea_customer_id| persona_model| process_date|
+--------------------+--------------------+-------------+
|000000E91010441BB...|Map(Tech -> [0.21...|2018-05-16-01|
+--------------------+--------------------+-------------+
I want to create a new dataframe which contains 2 coloums of x.pangaea_customer_id
and its respective score (which is inside map).
Here is what I have tried so far, I am using this command:
val newDF = oldDF.select(col("pangaea_customer_id"), col("persona_model")("Tech")("score"))
but this only gives values of score whose key is "Tech", I want all the score values for all the customers, what should I replace "Tech" with?
my output is here,
scala> newDF.show(10,false)
+--------------------------------+-------------------------+
|pangaea_customer_id |persona_model[Tech].score|
+--------------------------------+-------------------------+
|000000E91010441BB122402A45D439E7|0.21678 |
|000000FB2B304F60B244FEAFDE932640|null |
|000003E2565A4C88B9DAADDE5B5ADE71|null |
|000009D9D1B3443E95F21C58D708B196|null |
|000009EB8F6C4BFABA730726DCFE1FEE|null |
|0000119D3561461E96F8BA2B9523579A|null |
|00001296DC394AED93A19BBD79A5533C|null |
|000014D91E6D4A44AA98E0118E349A52|null |
|0000156A2B5D4275980AB9FD4F8C9163|null |
|000015EC31FC426E9A5477FE0A857982|1.23 |
+--------------------------------+-------------------------+
it is showing null score for all those ids whose key int the map is "tech" which makes sense because i have typed "tech" in my above command also. but i want all the scores and not the null values.