My input:
+-------------------+------------------+-----------------+
| TransactionDate|Product Major Code|Gross Trade Sales|
+-------------------+------------------+-----------------+
|2017-09-30 00:00:00| A| 100.0|
|2017-06-30 00:00:00| B| 200.0|
|2017-06-30 00:00:00| C| 300.0|
+-------------------+------------------+-----------------+
My code:
df.registerTempTable("tmp")
df2=spark.sql("SELECT TransactionDate,'Product Major Code', sum('Gross Trade Sales') FROM tmp GROUP BY TransactionDate,'Product Major Code'")
spark.catalog.dropTempView('tmp')
My output:
+-------------------+------------------+--------------------------------------+
| TransactionDate|Product Major Code|sum(CAST(Gross Trade Sales AS DOUBLE))|
+-------------------+------------------+--------------------------------------+
|2017-09-30 00:00:00|Product Major Code| null|
|2017-06-30 00:00:00|Product Major Code| null|
+-------------------+------------------+--------------------------------------+
Anyone know why it's not correctly aggregating Product Major code and Gross Trade Sales?
Update:
I went in the end with PaulITs answer below as it was more elegant and doesn't have to worry about backticks:
import pyspark.sql.functions as f
trydf.groupBy(f.col("TransactionDate"), f.col("Product Major Code")).agg(f.sum(f.col("Gross Trade Sales"))).show()