How to encode the grouped data in spark.dataframe?

Question

I have the data like that in a spark.dataframe A :

Ben       1
Ben       2
Ben       4
Ben       3
Jerry     2
Jerry     2
Jane      3
Jane      5
James     1
James     1

We have the Action_id range 1-5. We want to get the spark.dataframe B like that:

Name     Action_id=1    Action_id=2   Action_id=3  Action_id=4  Action_id=5
Ben          1              1            1               1            0
Jane         0              0            1               0            1
Jerry        0              2            0               0            0            
James        2              0            0               0            0

For example, the '1' in (Ben,Action_id=1) means that in the previous dataframe, Ben take action 1 for one time.

How can I transform the dataframe A to dataframe B ?

Please check this https://stackoverflow.com/a/37487099/11190218 — Prathik Kini, Aug 07 '19 at 09:36

score 0 · Accepted Answer · answered Aug 07 '19 at 09:31

You are looking for a PivotTable using Count aggregation:

In Scala :

import org.apache.spark.sql.{functions => F}

val df = Seq(("Ben", 1),
("Ben", 2),
("Ben", 4),
("Ben", 3),
("Jerry", 2),
("Jerry", 2),
("Jane", 3),
("Jane", 5),
("James", 1),
("James", 1)).toDF("Name", "Action_id")

df.groupBy("Name").pivot("Action_id").agg(F.count("Action_id")).na.fill(0).show

I do not have access to pyspark shell right now, but this should be something like this :

import pyspark.sql.functions as F

(df
    .groupby(df.Name)
    .pivot("Action_id")
    .agg(F.count("Action_id"))
    .na.fill(0)
    .show())

How to encode the grouped data in spark.dataframe?

1 Answers1