3

i'm the newbie of pyspark, just known the most easiest operation of it. and my english is very bad, sorry, i can't descripe it very detail, the following is the sample! thanks for your answers

  • dataframe like this:

| name |    id | flag  | cnt |
| li   | 19196 | true  |  10 |
| li   | 19196 | false |  15 |
  • i want to convert it to:

| name |    id | flag_true | flag_false |
| li   | 19196 |        10 |         15 |
catbrain
  • 31
  • 1

1 Answers1

3

You can use a pivot table for that:

df.groupBy(['name', 'id'])\
  .pivot('flag')\
  .agg(f.sum('cnt'))\
  .withColumnRenamed('true', 'flag_true')\
  .withColumnRenamed('false', 'flag_false')\
  .show()

That prints:

+----+-----+----------+---------+
|name|   id|flag_false|flag_true|
+----+-----+----------+---------+
|  li|19196|        15|       10|
+----+-----+----------+---------+
ernest_k
  • 44,416
  • 5
  • 53
  • 99