how to specify the partition for mapPartition in spark

Question

What I would like to do is compute each list separately so for example if I have 5 list ([1,2,3,4,5,6],[2,3,4,5,6],[3,4,5,6],[4,5,6],[5,6]) and I would like to get the 5 lists without the 6 I would do something like :

data=[1,2,3,4,5,6]+[2,3,4,5,6,7]+[3,4,5,6,7,8]+[4,5,6,7,8,9]+[5,6,7,8,9,10]

def function_1(iter_listoflist):
    final_iterator=[]
    for sublist in iter_listoflist:
        final_iterator.append([x for x in sublist if x!=6])
    return iter(final_iterator)  

sc.parallelize(data,5).glom().mapPartitions(function_1).collect()

then cut the lists so I get the first lists again. Is there a way to simply separate the computation? I don't want the lists to mix and they might be of different sizes.

thank you

Philippe

no not always the last element I just want to compute lists in parallel of each other. the entry of parallelize is whatever works and this worked for lists of same size. If there is a way to not use parallelize and give directly the partition that would be great. I just want it to compute different list seperate from each other and give me the different results which are also lists — Philippe C, Nov 06 '15 at 08:49

score 1 · Accepted Answer · answered Nov 06 '15 at 08:59

1

As far as I understand your intentions all you need here is to keep individual lists separate when you parallelize your data:

data = [[1,2,3,4,5,6], [2,3,4,5,6,7], [3,4,5,6,7,8],
    [4,5,6,7,8,9], [5,6,7,8,9,10]]

rdd = sc.parallelize(data)

rdd.take(1) # A single element of a RDD is a whole list
## [[1, 2, 3, 4, 5, 6]]

Now you can simply map using a function of your choice:

def drop_six(xs):
    return [x for x in xs if x != 6]

rdd.map(drop_six).take(3)
## [[1, 2, 3, 4, 5], [2, 3, 4, 5, 7], [3, 4, 5, 7, 8]]

answered Nov 06 '15 at 08:59

zero323

322,348
103
959
935

I made a mistake and gave the wrong data thank you for everything you've been of great help! @ – Philippe C Nov 06 '15 at 10:47
I am glad I could help. – zero323 Nov 06 '15 at 10:47
actually I just tried is with mapPartitions instead of map and it didn't work! it gave me the same list as input, I think there is something I don't understand even after http://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions and the documentation – Philippe C Nov 06 '15 at 11:04
does map computes in parallel because of the parallelize call? – Philippe C Nov 06 '15 at 11:07
And it shouldn't work. Moreover there is no reason to use `mapPartitions` here. `map` over RDD works in parallel and simplifying things __a lot__ you can think that it is a result of `parallelize`. – zero323 Nov 06 '15 at 11:12

how to specify the partition for mapPartition in spark

1 Answers1

Linked