I currently am trying to do some aggregation on the services column. I would like to group all the similar services and sum the values, and if possible flatten this into a single row.
Input:
+------------------+--------------------+
| cid | Services|
+------------------+--------------------+
|845124826013182686| [112931, serv1]|
|845124826013182686| [146936, serv1]|
|845124826013182686| [32718, serv2]|
|845124826013182686| [28839, serv2]|
|845124826013182686| [8710, serv2]|
|845124826013182686| [2093140, serv3]|
Hopeful Output:
+------------------+--------------------+------------------+--------------------+
| cid | serv1 | serv2 | serv3 |
+------------------+--------------------+------------------+--------------------+
|845124826013182686| 259867 | 70267 | 2093140 |
Below is the code I currently have
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("Service Aggregation").getOrCreate()
pathToFile = '/path/to/jsonfile'
df = spark.read.json(pathToFile)
df2 = df.select('cid',functions.explode_outer(df.nodes.services))
finaldataFrame = df2.select('cid',(functions.explode_outer(df2.col)).alias('Services'))
finaldataFrame.show()
I am quite new to pyspark and have been looking at resources and trying to create some UDF to apply to that column but the map function withi pyspark only works fro RDDs and not DataFrames and am unsure how move forward to get the desired output.
Any suggestions or help would be much appreciated.
Result of printSchema
root
|-- clusterId: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cpuCoreInSeconds: long (nullable = true)
| | |-- name: string (nullable = true)