4

That title, yes horrible, sorry. Here' what I mean: Here's the starting dataset

C1   C2
AA   H
AB   M
AC   M
AA   H
AA   L
AC   L

Then it would turn into a new dataset with 4 columns:

C1   CH   CM   CL
AA   2    0    1
AB   0    1    0
AC   0    1    1
BryceSoker
  • 624
  • 1
  • 11
  • 29

1 Answers1

8

You can use the pivot api as following with groupBy and agg and other functions as

from pyspark.sql import functions as F
finaldf = df.groupBy("C1").pivot("C2").agg(F.count("C2").alias("count")).na.fill(0)

and you should have finaldf as

+---+---+---+---+
| C1|  H|  L|  M|
+---+---+---+---+
| AA|  2|  1|  0|
| AB|  0|  0|  1|
| AC|  0|  1|  1|
+---+---+---+---+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97