0

My input:

+-------------------+------------------+-----------------+
|    TransactionDate|Product Major Code|Gross Trade Sales|
+-------------------+------------------+-----------------+
|2017-09-30 00:00:00|                 A|            100.0|
|2017-06-30 00:00:00|                 B|            200.0|
|2017-06-30 00:00:00|                 C|            300.0|
+-------------------+------------------+-----------------+

My code:

df.registerTempTable("tmp")
df2=spark.sql("SELECT TransactionDate,'Product Major Code', sum('Gross Trade Sales') FROM tmp GROUP BY TransactionDate,'Product Major Code'")
spark.catalog.dropTempView('tmp')

My output:

+-------------------+------------------+--------------------------------------+
|    TransactionDate|Product Major Code|sum(CAST(Gross Trade Sales AS DOUBLE))|
+-------------------+------------------+--------------------------------------+
|2017-09-30 00:00:00|Product Major Code|                                  null|
|2017-06-30 00:00:00|Product Major Code|                                  null|
+-------------------+------------------+--------------------------------------+

Anyone know why it's not correctly aggregating Product Major code and Gross Trade Sales?

Update:

I went in the end with PaulITs answer below as it was more elegant and doesn't have to worry about backticks:

import pyspark.sql.functions as f trydf.groupBy(f.col("TransactionDate"), f.col("Product Major Code")).agg(f.sum(f.col("Gross Trade Sales"))).show()

Reddspark
  • 6,934
  • 9
  • 47
  • 64
  • 1
    Try wrapping `Gross Trade Sales` and `Product Major Code` in backticks. As written you are trying to sum the literal string `"Gross Trade Sales"` casted as a double, which is `null`. – pault Dec 28 '18 at 17:22
  • Wow that actually worked! Would never have guessed that --that's not python-esque is it? Either way post it as an answer so I can award you the points. – Reddspark Dec 28 '18 at 17:25
  • IMO you should accept the duplicate target instead. – pault Dec 28 '18 at 17:27
  • 1
    I would also add that you can achieve the same from the API functions: First `import pyspark.sql.functions as f` and then try`df.groupBy(f.col("TransactionDate"), f.col("Product Major Code")).agg(f.sum(f.col("Gross Trade Sales"))).show()` – pault Dec 28 '18 at 17:47

0 Answers0