Monthly Aggregation in pyspark

Question

I'm looking for a way to aggregate by month my data. I want firstly to keep only month in my visitdate. My DataFrame looks like this:

Row(visitdate = 1/1/2013, 
patientid = P1_Pt1959, 
amount = 200, 
note = jnut, 
)

My objectif subsequently is to group by visitdate and calculate the sum of amount. I tried this :

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

file_path = "G:/Visit Data.csv"
patients = spark.read.csv(file_path,header = True)
patients.createOrReplaceTempView("visitdate")

sqlDF = spark.sql("SELECT visitdate,SUM(amount) as totalamount from visitdate GROUP BY visitdate")
sqlDF.show()

This is the result :

visitdate|totalamount|
+----------+-----------+
|  9/1/2013|    10800.0|
|25/04/2013|    12440.0|
|27/03/2014|    16930.0|
|26/03/2015|    18560.0|
|14/05/2013|    13770.0|
|30/06/2013|    13880.0

My objectif is to get something like this:

  visitdate|totalamount|
+----------+-----------+
|1/1/2013|    10800.0|
|1/2/2013|    12440.0|
|1/3/2013|    16930.0|
|1/4/2014|    18560.0|
|1/5/2015|    13770.0|
|1/6/2015|    13880.0|

Possible duplicate of [Group spark dataframe by date](https://stackoverflow.com/questions/34946051/group-spark-dataframe-by-date) — LN_P, Jul 15 '19 at 13:53

enys · Answer 1 · 2018-09-29T12:15:10.727

3

You could format visitdate first and then do the group by:

from pyspark.sql import functions as F

(df.withColumn('visitdate_month', F.date_format(F.col('visitdate'), '1/M/yyyy'))
.groupBy('visitdate_month')
.agg(F.sum(F.col('visitdate_month')))
)

edited Sep 29 '18 at 12:15

answered Sep 29 '18 at 08:15

enys

513
3
11

getting null data in the middel visitdate_month| count| +---------------+------+ | 2014-08-01| 241| | 2015-04-01| 245| | 2014-11-01| 247| | 2014-09-01| 246| | 2013-08-01| 187| | 2015-02-01| 269| | 2014-05-01| 277| | null|204363| | 2015-06-01| 260| | 2015-03-01| 242| | 2013-12-01| 181| – Sep 29 '18 at 11:22
Sorry, you have to convert from your string to date and then to the other string format. See my updated answer. – enys Sep 29 '18 at 11:36
this time entire data coming null +---------------+--------------------+ |visitdate_month|sum(visitdate_month)| +---------------+--------------------+ | 1/01/2013| null| +---------------+--------------------+ – Sep 29 '18 at 11:50
I updated my answer. `date_format` can parse a date and the convert it to your format. – enys Sep 29 '18 at 12:16

score 0 · Answer 2 · answered Oct 01 '18 at 07:56

0

Looks like column "visitdate" type is String, can be converted to Timestamp with "unix_timestamp", and then to String in required format with "date_format". After this, grouping can be performed, on Scala:

val visitMonthDf = df.withColumn("visitMonth", date_format(unix_timestamp($"visitdate", "dd/MM/yyyy").cast(TimestampType), "1/M/yyyy"))
visitMonthDf.groupBy("visitMonth").agg(sum($"totalamount"))

answered Oct 01 '18 at 07:56

pasha701

6,831
1
15
22

Note: The question is tagged with python – OneCricketeer Oct 01 '18 at 08:09
Guess, can be translated to Python by question author. – pasha701 Oct 01 '18 at 08:18

Monthly Aggregation in pyspark

2 Answers2