2

I have a simple dataset that looks like:

|destination|delay|month|
+-----------+-----+-----+
|        ORD|   92|    1|
|        JFK|   -7|    1|
|        ORD|   12|    2|
|        JFK|    2|    2|
|        ORD|   32|    3|
|        JFK|   30|    3|
+-----------+-----+-----+

I am looking at different pivot options. With the dataframe API it is possible to limit the pivot values (method 2 in the code below) or compute all pivot values dynamically (method 1). In spark SQL it seems that it is possible to limit the pivot values (method 4), but it seems that it is not possible to compute all pivot values dynamically (method 3 does not work).

Of course, one could compute the unique months in a separate calculation and put them in a string, but this is not look very elegant and is verbose.

Is there a good way to make method 3 work?

# get a spark session
spark = SparkSession.builder.appName('learn').getOrCreate()

# create dataset
schema = StructType([StructField('destination', StringType(), nullable=False),
                     StructField('delay', LongType(), nullable=False),
                     StructField('month', ByteType(), nullable=False)
                     ])
data = [
    ('ORD',  92, 1),
    ('JFK',  -7, 1),
    ('ORD' , 12, 2),
    ('JFK',   2, 2),
    ('ORD',  32, 3),
    ('JFK',  30, 3)
]
df = spark.createDataFrame(data, schema=schema)

# pivot (dataframe API), concise but expensive, method 1
# https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.pivot.html?highlight=pivot
# pivot cannot be lazy!
res = df.groupby('destination').pivot('month').agg(f.sum('delay').alias('total_delay'), f.avg('delay').alias('average_delay'))


# pivot (dataframe API), less concise (only includes a few pivot columns), method 2
# https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.pivot.html?highlight=pivot
# pivot cannot be lazy!
res = df.groupby('destination').pivot('month', [1,2]).agg(f.sum('delay').alias('total_delay'), f.avg('delay').alias('average_delay'))
res.show()

# pivot (SQL), concise but expensive, method 3 DOES NOT WORK: (SELECT distinct month from tmp_view) is not accepted
# https://spark.apache.org/docs/3.1.2/sql-ref-syntax-qry-select-pivot.html
# pivot cannot be lazy!
df.createOrReplaceTempView('tmp_view')
query = """
SELECT * FROM tmp_view
    PIVOT (
        SUM(delay) AS total_delay, AVG(delay) AS average_delay
        FOR month IN (SELECT distinct month from tmp_view)
    )
"""
res = spark.sql(query)
res.show()

# pivot (SQL), less concise (only includes a few pivot columns), method 4
# https://spark.apache.org/docs/3.1.2/sql-ref-syntax-qry-select-pivot.html
# pivot cannot be lazy!
df.createOrReplaceTempView('tmp_view')
query = """
SELECT * FROM tmp_view
    PIVOT (
        SUM(delay) AS total_delay, AVG(delay) AS average_delay
        FOR month IN (1 AS Jan, 2 AS Feb)
    )
"""
res = spark.sql(query)
karpan
  • 421
  • 1
  • 5
  • 13
  • i found this [post](https://stackoverflow.com/questions/30244910/how-to-pivot-spark-dataframe), people are asking the same question in the comments. I couldn't find any solution to make "method 3" works, but it does not mean that it does not exist. Just for you to know, when you do not specify the list like "method 1", spark will first collect the distinct data and then perform the pivot. Methods 2 and 4 avoid this pre-pivot computation time. Knowing that, collecting first the data to add them as string is probably the only solution. – Steven Jul 26 '21 at 12:27

0 Answers0