Find all permutations of values in Spark RDD; python

Question

I have a spark RDD (myData) that has been mapped as a list. The output of myData.collect() yields the following:

['x', 'y', 'z']

What operation can I perform on myData to map to or create a new RDD containing a list of all permutations of xyz? For example newData.collect() would output:

['xyz', 'xzy', 'zxy', 'zyx', 'yxz', 'yzx']

I've tried using variations of cartesian(myData), but as far as I can tell, the best that gives is different combinations of two-value pairs.

how about this http://stackoverflow.com/questions/2052951/python-get-all-permutations-of-numbers — Anis Jonischkeit, Apr 30 '17 at 05:01

AChampion · Answer 1 · 2017-04-30T06:16:17.473

1

Doing this all in pyspark. You can use rdd.cartesian but you have filter out repeats and do it twice (not saying this is good!!!):

 >>> rdd1 = rdd.cartesian(rdd).filter(lambda x: x[1] not in x[0]).map(lambda x: ''.join(x))
 >>> rdd1.collect()
 ['xy', 'xz', 'yx', 'yz', 'zx', 'zy']
 >>> rdd2 = rdd1.cartesian(rdd).filter(lambda x: x[1] not in x[0]).map(lambda x: ''.join(x))
 >>> rdd2.collect()
 ['xyz', 'xzy', 'yxz', 'yzx', 'zxy', 'zyx']

edited Apr 30 '17 at 06:16

answered Apr 30 '17 at 05:56

AChampion

29,683
4
59
75

This works for the given data set, but it appears it won't scale very well as I'm sure you're aware. Was looking for something that would still apply if I had say 10 unique letters. Thanks for the help though! – HSskillet Apr 30 '17 at 17:04

JkShaw · Answer 2 · 2017-05-01T05:11:26.917

0

>>> from itertools import permutations
>>> t = ['x', 'y', 'z']
>>> ["".join(item) for item in permutations(t)]

['xyz', 'xzy', 'yxz', 'yzx', 'zxy', 'zyx']

Note: RDD object can be converted to iterables using toLocalIterator

edited May 01 '17 at 05:11

answered Apr 30 '17 at 05:02

JkShaw

1,927
2
13
14

You don't need that `list` call, you can iterate directly over the iterable returned by `permutations`. – PM 2Ring Apr 30 '17 at 05:19
Hmm, this works outside of the scope of the RDD, but if I try it within the RDD pipeline I get the error "'TypeError: 'PipelinedRDD' object is not iterable". Example trys: test1 = originalRDD.map(lambda a: ["".join(item) for item in permutations(a)]) test1 = originalRDD.map(["".join(item) for item in permutations(originalRDD)) I suppose I might need to settle for collecting first.. – HSskillet Apr 30 '17 at 05:30
@HSskillet I don't know Spark, but that should work if you convert the RDD object that you're passing to `permutations` to a list or tuple. Also, you might be able to pass RDD.map a generator expression instead of a list comprehension. – PM 2Ring Apr 30 '17 at 05:45
@HSskillet, try converting your `RDD object` to `iterables` using `toLocalIterator()` – JkShaw Apr 30 '17 at 06:31
@JkShaw Best solution yet, although I suppose there isn't too much of a difference between this and collect(), but the use of an iterator appears more efficient than making a whole new list. Regardless, I am just parallelizing it again afterward. Thanks for the help! – HSskillet Apr 30 '17 at 17:22

Find all permutations of values in Spark RDD; python

2 Answers2

Linked