I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.
I can get the count if I use the string date column.
crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()
and I get this output
+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
| 13/08/2010| 342|
| 6/10/2011| 334|
| 27/11/2011| 269|
| 12/01/2012| 303|
| 22/02/2012| 286|
| 31/07/2012| 276|
| 25/04/2013| 222|
+-------------+------------------+
To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date". However, when I try to use the same code to group by the new column and do the count I get an error message.
crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()
TypeError: strptime() argument 1 must be str, not None
This is the data schema:
root
|-- Offence Count: integer (nullable = true)
|-- Reported Date: string (nullable = true)
|-- Date: date (nullable = true)
Is there a way to fix this error? or extract the year using another method? Thank you