1

I have a spark RDD (myData) that has been mapped as a list. The output of myData.collect() yields the following:

['x', 'y', 'z']

What operation can I perform on myData to map to or create a new RDD containing a list of all permutations of xyz? For example newData.collect() would output:

['xyz', 'xzy', 'zxy', 'zyx', 'yxz', 'yzx']

I've tried using variations of cartesian(myData), but as far as I can tell, the best that gives is different combinations of two-value pairs.

HSskillet
  • 63
  • 6

2 Answers2

1

Doing this all in pyspark. You can use rdd.cartesian but you have filter out repeats and do it twice (not saying this is good!!!):

 >>> rdd1 = rdd.cartesian(rdd).filter(lambda x: x[1] not in x[0]).map(lambda x: ''.join(x))
 >>> rdd1.collect()
 ['xy', 'xz', 'yx', 'yz', 'zx', 'zy']
 >>> rdd2 = rdd1.cartesian(rdd).filter(lambda x: x[1] not in x[0]).map(lambda x: ''.join(x))
 >>> rdd2.collect()
 ['xyz', 'xzy', 'yxz', 'yzx', 'zxy', 'zyx']
AChampion
  • 29,683
  • 4
  • 59
  • 75
  • This works for the given data set, but it appears it won't scale very well as I'm sure you're aware. Was looking for something that would still apply if I had say 10 unique letters. Thanks for the help though! – HSskillet Apr 30 '17 at 17:04
0
>>> from itertools import permutations
>>> t = ['x', 'y', 'z']
>>> ["".join(item) for item in permutations(t)]

['xyz', 'xzy', 'yxz', 'yzx', 'zxy', 'zyx']

Note: RDD object can be converted to iterables using toLocalIterator

JkShaw
  • 1,927
  • 2
  • 13
  • 14
  • You don't need that `list` call, you can iterate directly over the iterable returned by `permutations`. – PM 2Ring Apr 30 '17 at 05:19
  • Hmm, this works outside of the scope of the RDD, but if I try it within the RDD pipeline I get the error "'TypeError: 'PipelinedRDD' object is not iterable". Example trys: test1 = originalRDD.map(lambda a: ["".join(item) for item in permutations(a)]) test1 = originalRDD.map(["".join(item) for item in permutations(originalRDD)) I suppose I might need to settle for collecting first.. – HSskillet Apr 30 '17 at 05:30
  • @HSskillet I don't know Spark, but that should work if you convert the RDD object that you're passing to `permutations` to a list or tuple. Also, you might be able to pass RDD.map a generator expression instead of a list comprehension. – PM 2Ring Apr 30 '17 at 05:45
  • @HSskillet, try converting your `RDD object` to `iterables` using `toLocalIterator()` – JkShaw Apr 30 '17 at 06:31
  • @JkShaw Best solution yet, although I suppose there isn't too much of a difference between this and collect(), but the use of an iterator appears more efficient than making a whole new list. Regardless, I am just parallelizing it again afterward. Thanks for the help! – HSskillet Apr 30 '17 at 17:22