Why is this spark sql query for the sum of a column returning null?

Question

My input:

+-------------------+------------------+-----------------+
|    TransactionDate|Product Major Code|Gross Trade Sales|
+-------------------+------------------+-----------------+
|2017-09-30 00:00:00|                 A|            100.0|
|2017-06-30 00:00:00|                 B|            200.0|
|2017-06-30 00:00:00|                 C|            300.0|
+-------------------+------------------+-----------------+

My code:

df.registerTempTable("tmp")
df2=spark.sql("SELECT TransactionDate,'Product Major Code', sum('Gross Trade Sales') FROM tmp GROUP BY TransactionDate,'Product Major Code'")
spark.catalog.dropTempView('tmp')

My output:

+-------------------+------------------+--------------------------------------+
|    TransactionDate|Product Major Code|sum(CAST(Gross Trade Sales AS DOUBLE))|
+-------------------+------------------+--------------------------------------+
|2017-09-30 00:00:00|Product Major Code|                                  null|
|2017-06-30 00:00:00|Product Major Code|                                  null|
+-------------------+------------------+--------------------------------------+

Anyone know why it's not correctly aggregating Product Major code and Gross Trade Sales?

Update:

I went in the end with PaulITs answer below as it was more elegant and doesn't have to worry about backticks:

import pyspark.sql.functions as f trydf.groupBy(f.col("TransactionDate"), f.col("Product Major Code")).agg(f.sum(f.col("Gross Trade Sales"))).show()

Try wrapping `Gross Trade Sales` and `Product Major Code` in backticks. As written you are trying to sum the literal string `"Gross Trade Sales"` casted as a double, which is `null`. — pault, Dec 28 '18 at 17:22
Wow that actually worked! Would never have guessed that --that's not python-esque is it? Either way post it as an answer so I can award you the points. — Reddspark, Dec 28 '18 at 17:25
I would also add that you can achieve the same from the API functions: First `import pyspark.sql.functions as f` and then try`df.groupBy(f.col("TransactionDate"), f.col("Product Major Code")).agg(f.sum(f.col("Gross Trade Sales"))).show()` — pault, Dec 28 '18 at 17:47

Why is this spark sql query for the sum of a column returning null?

0 Answers0