1

I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.

I can get the count if I use the string date column.

crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()

and I get this output

+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
|   13/08/2010|               342|
|    6/10/2011|               334|
|   27/11/2011|               269|
|   12/01/2012|               303|
|   22/02/2012|               286|
|   31/07/2012|               276|
|   25/04/2013|               222|
+-------------+------------------+

To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date". However, when I try to use the same code to group by the new column and do the count I get an error message.

crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()
TypeError: strptime() argument 1 must be str, not None

This is the data schema:

root
 |-- Offence Count: integer (nullable = true)
 |-- Reported Date: string (nullable = true)
 |-- Date: date (nullable = true)

Is there a way to fix this error? or extract the year using another method? Thank you

leena
  • 563
  • 1
  • 8
  • 25

1 Answers1

3

If I understand correctly then you want to extract the year from String date column. Of course, one way is using regex but sometimes it can throw your logic off if regex is not handling all scenarios.

here is the date data type approach.

Imports

import pyspark.sql.functions as f

Creating your Dataframe

l1 = [('13/08/2010',342),('6/10/2011',334),('27/11/2011',269),('12/01/2012',303),('22/02/2012',286),('31/07/2012',276),('25/04/2013',222)]
dfl1 =  spark.createDataFrame(l1).toDF("dates","sum")

dfl1.show()
+----------+---+
|     dates|sum|
+----------+---+
|13/08/2010|342|
| 6/10/2011|334|
|27/11/2011|269|
|12/01/2012|303|
|22/02/2012|286|
|31/07/2012|276|
|25/04/2013|222|
+----------+---+

Now, You can use to_timestamp or to_date apis of functions package

dfl2 = dfl1.withColumn('years',f.year(f.to_timestamp('dates', 'dd/MM/yyyy')))

dfl2.show()
+----------+---+-----+
|     dates|sum|years|
+----------+---+-----+
|13/08/2010|342| 2010|
| 6/10/2011|334| 2011|
|27/11/2011|269| 2011|
|12/01/2012|303| 2012|
|22/02/2012|286| 2012|
|31/07/2012|276| 2012|
|25/04/2013|222| 2013|
+----------+---+-----+

Now, group by on years.

dfl2.groupBy('years').sum('sum').show()
+-----+--------+                                                                
|years|sum(sum)|
+-----+--------+
| 2013|     222|
| 2012|     865|
| 2010|     342|
| 2011|     603|
+-----+--------+

Showing into multiple steps for understanding but you can combine extract year and group by in one step.

Happy to extend if you need some other help.

SMaZ
  • 2,515
  • 1
  • 12
  • 26
  • Thank you for your answer SMaZ. I applied it to my DataFrame and it returned a result similar to yours. The problem now is that the output is showing duplicated years? I am not getting the sum per year. Which is the main goal of this code. – leena Aug 25 '19 at 02:09
  • Do you want to retain your original date column? or just sum at the year? – SMaZ Aug 25 '19 at 02:18
  • Great, Glad could help. Just edited answer too. Happy coding..! – SMaZ Aug 25 '19 at 02:23