I have a simple dataset that looks like:
|destination|delay|month|
+-----------+-----+-----+
| ORD| 92| 1|
| JFK| -7| 1|
| ORD| 12| 2|
| JFK| 2| 2|
| ORD| 32| 3|
| JFK| 30| 3|
+-----------+-----+-----+
I am looking at different pivot options. With the dataframe API it is possible to limit the pivot values (method 2 in the code below) or compute all pivot values dynamically (method 1). In spark SQL it seems that it is possible to limit the pivot values (method 4), but it seems that it is not possible to compute all pivot values dynamically (method 3 does not work).
Of course, one could compute the unique months in a separate calculation and put them in a string, but this is not look very elegant and is verbose.
Is there a good way to make method 3 work?
# get a spark session
spark = SparkSession.builder.appName('learn').getOrCreate()
# create dataset
schema = StructType([StructField('destination', StringType(), nullable=False),
StructField('delay', LongType(), nullable=False),
StructField('month', ByteType(), nullable=False)
])
data = [
('ORD', 92, 1),
('JFK', -7, 1),
('ORD' , 12, 2),
('JFK', 2, 2),
('ORD', 32, 3),
('JFK', 30, 3)
]
df = spark.createDataFrame(data, schema=schema)
# pivot (dataframe API), concise but expensive, method 1
# https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.pivot.html?highlight=pivot
# pivot cannot be lazy!
res = df.groupby('destination').pivot('month').agg(f.sum('delay').alias('total_delay'), f.avg('delay').alias('average_delay'))
# pivot (dataframe API), less concise (only includes a few pivot columns), method 2
# https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.pivot.html?highlight=pivot
# pivot cannot be lazy!
res = df.groupby('destination').pivot('month', [1,2]).agg(f.sum('delay').alias('total_delay'), f.avg('delay').alias('average_delay'))
res.show()
# pivot (SQL), concise but expensive, method 3 DOES NOT WORK: (SELECT distinct month from tmp_view) is not accepted
# https://spark.apache.org/docs/3.1.2/sql-ref-syntax-qry-select-pivot.html
# pivot cannot be lazy!
df.createOrReplaceTempView('tmp_view')
query = """
SELECT * FROM tmp_view
PIVOT (
SUM(delay) AS total_delay, AVG(delay) AS average_delay
FOR month IN (SELECT distinct month from tmp_view)
)
"""
res = spark.sql(query)
res.show()
# pivot (SQL), less concise (only includes a few pivot columns), method 4
# https://spark.apache.org/docs/3.1.2/sql-ref-syntax-qry-select-pivot.html
# pivot cannot be lazy!
df.createOrReplaceTempView('tmp_view')
query = """
SELECT * FROM tmp_view
PIVOT (
SUM(delay) AS total_delay, AVG(delay) AS average_delay
FOR month IN (1 AS Jan, 2 AS Feb)
)
"""
res = spark.sql(query)