0

Converting structType to MapType in Spark.

Schema:

event: struct (nullable = true)
|    | event_category: string (nullable = true)
|    | event_name: string (nullable = true)
|    | properties: struct (nullable = true)
|    |    | prop1: string (nullable = true)
|    |    | prop2: string (nullable = true)

Sample data:

{ "event": {
     "event_category: "abc",
      "event_name": "click",
      "properties" : {
          "prop1": "prop1Value",
          "prop2": "prop2Value",
          ....
      }
   } 
}

Need values as:

event_category | event_name | properties_key | properties_value | 
abc            | click      | prop1          | prop1Value
abc            | click      | prop2          | prop2Value
zero323
  • 322,348
  • 103
  • 959
  • 935
shiva.n404
  • 463
  • 1
  • 7
  • 18
  • 1
    Duplicate of [Transpose column to row with Spark](https://stackoverflow.com/q/37864222/6910411) and [Pandas Melt function in Apache Spark](https://stackoverflow.com/q/41670103/6910411) and [unpivot in spark-sql/pyspark](https://stackoverflow.com/q/42465568/6910411). Also duplicate of [your previous question](https://stackoverflow.com/q/48315442/6910411). – zero323 Jan 18 '18 at 22:45
  • 1
    Also please don't post random tags - normally you either use Scala or PySpark / Python. – zero323 Jan 18 '18 at 22:50

1 Answers1

0

You will have to find some mechanism to create map of properties struct. I have used udf function to zip the key and values and return arrays of key and value.

import org.apache.spark.sql.functions._
def collectUdf = udf((cols: collection.mutable.WrappedArray[String], values: collection.mutable.WrappedArray[String]) => cols.zip(values))

Multiple generators are not supported in spark so you will have to save the dataframe to temporary dataframe.

val columnsMap = df_json.select($"event.properties.*").columns
val temp = df_json.withColumn("event_properties", explode(collectUdf(lit(columnsMap), array($"event.properties.*"))))

The last step would be to just separate the event_properties column

temp.select($"event.event_category", $"event.event_name", $"event_properties._1".as("properties_key"), $"event_properties._2".as("properties_value")).show(false)

You should have what you desire

+--------------+----------+--------------+----------------+
|event_category|event_name|properties_key|properties_value|
+--------------+----------+--------------+----------------+
|abc           |click     |prop1         |prop1Value      |
|abc           |click     |prop2         |prop2Value      |
+--------------+----------+--------------+----------------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97