PySpark groupByKey returning pyspark.resultiterable.ResultIterable

Question

I am trying to figure out why my groupByKey is returning the following:

[(0, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a210>), (1, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a4d0>), (2, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a390>), (3, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a290>), (4, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a450>), (5, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a350>), (6, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a1d0>), (7, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a490>), (8, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a050>), (9, <pyspark.resultiterable.ResultIterable object at 0x7fc659e0a650>)]

I have flatMapped values that look like this:

[(0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D'), (0, u'D')]

I'm doing just a simple:

groupRDD = columnRDD.groupByKey()

score 85 · Accepted Answer · answered Apr 18 '15 at 14:52

85

What you're getting back is an object which allows you to iterate over the results. You can turn the results of groupByKey into a list by calling list() on the values, e.g.

example = sc.parallelize([(0, u'D'), (0, u'D'), (1, u'E'), (2, u'F')])

example.groupByKey().collect()
# Gives [(0, <pyspark.resultiterable.ResultIterable object ......]

example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(0, [u'D', u'D']), (1, [u'E']), (2, [u'F'])]

answered Apr 18 '15 at 14:52

dpeacock

2,697
13
16

40

`example.groupByKey().mapValues(list).collect()` is shorter and also works – Charity Leschinski Jul 21 '15 at 19:29
5

How can I map through the `ResultIterable` type? – xxx222 Nov 28 '16 at 08:56

score 31 · Answer 2 · edited May 03 '16 at 19:03

31

you can also use

example.groupByKey().mapValues(list)

edited May 03 '16 at 19:03

Community

1
1

answered Jun 28 '15 at 23:15

Jayaram

839
1
14
24

score 1 · Answer 3 · answered Feb 17 '16 at 06:51

Instead of using groupByKey(), i would suggest you use cogroup(). You can refer the below example.

[(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]

Example:

>>> x = sc.parallelize([("foo", 1), ("bar", 4)])
>>> y = sc.parallelize([("foo", -1)])
>>> z = [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
>>> print(z)

You should get the desired output...

score 1 · Answer 4 · edited Mar 20 '19 at 18:35

1

Example:

r1 = sc.parallelize([('a',1),('b',2)])
r2 = sc.parallelize([('b',1),('d',2)])
r1.cogroup(r2).mapValues(lambda x:tuple(reduce(add,__builtin__.map(list,x))))

Result:

[('d', (2,)), ('b', (2, 1)), ('a', (1,))]

edited Mar 20 '19 at 18:35

m0nhawk

22,980
9
45
73

answered Dec 07 '17 at 07:52

bin yan

11
1

score 1 · Answer 5 · answered Jan 04 '19 at 01:38

In addition to above answers, if you want the sorted list of unique items, use following:

List of Distinct and Sorted Values

example.groupByKey().mapValues(set).mapValues(sorted)

Just List of Sorted Values

example.groupByKey().mapValues(sorted)

Alternative's to above

# List of distinct sorted items
example.groupByKey().map(lambda x: (x[0], sorted(set(x[1]))))

# just sorted list of items
example.groupByKey().map(lambda x: (x[0], sorted(x[1])))

score 0 · Answer 6 · answered Mar 01 '19 at 20:32

Say your code is..

ex2 = ex1.groupByKey()

And then you run..

ex2.take(5)

You're going to see an iterable. This is okay if you're going to do something with this data, you can just move on. But, if all you want is to print/see the values first before moving on, here is a bit of a hack..

ex2.toDF().show(20, False)

or just

ex2.toDF().show()

This will show the values of the data. You shouldn't use collect() because that will return data to the driver, and if you're working off a lot of data, that's going to blow up on you. Now if ex2 = ex1.groupByKey() was your final step, and you want those results returned, then yes use collect() but make sure that you know your data being returned is low volume.

print(ex2.collect())

Here is another nice post on using collect() on RDD

View RDD contents in Python Spark?

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

6 Answers6

Linked