1

I have a problem restructuring data using Spark. The original data looks like this:

df = sqlContext.createDataFrame([
    ("ID_1", "VAR_1", "Butter"),
    ("ID_1", "VAR_2", "Toast"),
    ("ID_1", "VAR_3", "Ham"),
    ("ID_2", "VAR_1", "Jam"),
    ("ID_2", "VAR_2", "Toast"),
    ("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])

>>> df.show()
+----+-----+------+
|  ID|  VAR|   VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3|   Ham|
|ID_2|VAR_1|   Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3|   Egg|
+----+-----+------+

This is the structure I try to achieve:

+----+------+-----+-----+
|  ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast|  Ham|
|ID_2|   Jam|Toast|  Egg|
+----+------+-----+-----+

My idea was to use:

df.groupBy("ID").pivot("VAR").show()

But I get the following error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'

Any suggestions! Thanks!

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
CKre
  • 181
  • 2
  • 13

1 Answers1

1

You need to add an aggregation after pivot(). If you are sure there is only one "VAL" for each ("ID", "VAR") pair, you can use first():

from pyspark.sql import functions as f

result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()

+----+------+-----+-----+
|  ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast|  Ham|
|ID_2|   Jam|Toast|  Egg|
+----+------+-----+-----+
Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • As first is an "action" you could have it running slow. Maybe another possible solution would be: `df.groupBy("ID").pivot("VAR").agg(concat_ws('', collect_list(col("VAL"))))`. This should run faster. – Lucas Mendes Mota Da Fonseca Jun 03 '19 at 20:15
  • @LucasMendesMotaDaFonseca, first is not an action in this scenario, only when used directly on an RDD or DF. – Daniel de Paula Jun 05 '19 at 15:32