3

The rdd in pyspark are consist of four elements in every list :

[id1, 'aaa',12,87]
[id2, 'acx',1,90]
[id3, 'bbb',77,10]
[id2, 'bbb',77,10]
.....

I want to group by the ids in the first columns, and get the aggregate result of the other three columns: for example => [id2,[['acx',1,90], ['bbb',77,10]...]] How can I realize it ?

yanachen
  • 3,401
  • 8
  • 32
  • 64

1 Answers1

8
spark.version
# u'2.2.0'

rdd = sc.parallelize((['id1', 'aaa',12,87],
                      ['id2', 'acx',1,90],
                      ['id3', 'bbb',77,10],
                      ['id2', 'bbb',77,10]))

rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).collect()

# result:

[('id2', [['acx', 1, 90], ['bbb', 77, 10]]), 
 ('id3', [['bbb', 77, 10]]), 
 ('id1', [['aaa', 12, 87]])]

or, if you prefer lists strictly, you can add one more map operation after mapValues:

rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).map(lambda x: list(x)).collect()

# result:

[['id2', [['acx', 1, 90], ['bbb', 77, 10]]], 
 ['id3', [['bbb', 77, 10]]],
 ['id1', [['aaa', 12, 87]]]]
desertnaut
  • 57,590
  • 26
  • 140
  • 166