2

I have the region wise source of customer data like below

region,source,consumer_id
APAC,mail,1
APAC,referral,2
APAC,mail,3
APAC,referral,5
APAC,mail,6
APAC,referral,7
APAC,referral,8
US East,mail,9
US East,referral,10
US East,walkIn,11
AUS,walkIn,12
AUS,referral,13

Can someone help to get the region wise source count like below using pyspark dataframes.

region,mail_source_cnt, referral_source_cnt, walkIn_source_cnt
APAC,3,4,0
US EAST,1,1,1
AUS,0,1,1

Thanks for help

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Anil
  • 301
  • 4
  • 9

1 Answers1

2

You can aggregate to get count and pivot the columns as,

>>> from pyspark.sql import functions as F
>>> df.show()
+-------+--------+-----------+
| region|  source|consumer_id|
+-------+--------+-----------+
|   APAC|    mail|          1|
|   APAC|referral|          2|
|   APAC|    mail|          3|
|   APAC|referral|          5|
|   APAC|    mail|          6|
|   APAC|referral|          7|
|   APAC|referral|          8|
|US East|    mail|          9|
|US East|referral|         10|
|US East|  walkIn|         11|
|    AUS|  walkIn|         12|
|    AUS|referral|         13|
+-------+--------+-----------+
>>> df1 = df.groupby('region','source').count()
>>> df1.show()
+-------+--------+-----+
| region|  source|count|
+-------+--------+-----+
|US East|  walkIn|    1|
|    AUS|  walkIn|    1|
|   APAC|    mail|    3|
|   APAC|referral|    4|
|US East|    mail|    1|
|US East|referral|    1|
|    AUS|referral|    1|
+-------+--------+-----+
>>> df2 = df1.withColumn('ccol',F.concat(df1['source'],F.lit('_cnt'))).groupby('region').pivot('ccol').agg(F.first('count')).fillna(0)
>>> df2.show()
+-------+--------+------------+----------+
| region|mail_cnt|referral_cnt|walkIn_cnt|
+-------+--------+------------+----------+
|    AUS|       0|           1|         1|
|   APAC|       3|           4|         0|
|US East|       1|           1|         1|
+-------+--------+------------+----------+
Suresh
  • 5,678
  • 2
  • 24
  • 40