How to Do Data Parallelization in Python?

Question

So, I have 3 dimentional list. For example:

A=[[[1,2,3],[4,5,6],[7,8,9]],...,[[2,4,1],[1,4,6],[1,2,4]]]

I want to process each 2 dimentional list in A independently but they all have same process. If I do it in sequence, I do:

for i in range(len(A)):
    A[i]=process(A[i])

But, it takes very long time. Could you tell me how to parallel compute by data parallelization in Python?

Parallelization requires threading. That requires a lot of work. Since lists are mutable, you'll likely have to create a copy for each individual thread (if I'm smart at programming). Copying/slicing the list will likely take more time than processing it in a single thread. — Zizouz212, Oct 27 '16 at 01:43
List of options here... https://wiki.python.org/moin/ParallelProcessing — OneCricketeer, Oct 27 '16 at 01:45
@Zizouz212 so, there's no way that more efficient that processing it sequentially? — fahadh4ilyas, Oct 27 '16 at 01:56
Threading in Python is actually not parallel at all and would slow things down, because the Global Interpreter Lock only executes one Python instruction at a time regardless of the number of threads. It's only suitable for doing something while waiting for I/O on another thread. True multiprocessing would work, but sharing the data requires locking and would also quite likely slow things down if there's a lot of writing to the array. — Jim Stewart, Oct 27 '16 at 01:56
@Jim Stewart if I made that list into part by myself like B=A[0:10], C=A[10:20], etc, could you tell me how to process B, C, etc at the same time? — fahadh4ilyas, Oct 27 '16 at 02:00
@user7077941 My first choice would be to try to implement `process` with numpy. — Francisco, Oct 27 '16 at 02:02

niemmi · Accepted Answer · 2016-10-27T03:19:41.003

2

If you have multiple cores and processing each 2 dimensional list is expensive operation you could use Pool from multiprocessing. Here's a short example that squares numbers in different process:

import multiprocessing as mp

A = [[[1,2,3],[4,5,6],[7,8,9]],[[2,4,1],[1,4,6],[1,2,4]]]

def square(l):
    return [[x * x for x in sub] for sub in l]

pool = mp.Pool(processes=mp.cpu_count())
res = pool.map(square, A)

print res

Output:

[[[1, 4, 9], [16, 25, 36], [49, 64, 81]], [[4, 16, 1], [1, 16, 36], [1, 4, 16]]]

Pool.map will behave like built-in map while splitting the iterable to worker processes. It also has third parameter called chunksize that defines how big chunks are submitted to workers.

edited Oct 27 '16 at 03:19

answered Oct 27 '16 at 02:16

niemmi

17,113
7
35
42

Thank you very much! I'll try it later because my program still running sequentially since 30 minutes ago. So, I don't have to make A into some part and python will part it? – fahadh4ilyas Oct 27 '16 at 02:20
@user7077941 No, you don't have to split `A`. You might want to define third parameter called `chunksize` though depending on your data. – niemmi Oct 27 '16 at 02:30
Wow, such a simple way. Thank you. ^^ – fahadh4ilyas Oct 27 '16 at 02:32
Hi, I'm confused. I tried using your code but somehow error appeared "AttributeError" in with mp.Pool(processes=mp.cpu_count()) as pool. What should I do? – fahadh4ilyas Oct 27 '16 at 03:16
@user7077941 Didn't notice python 2.7 tag so I was running it with 3.5, I've edited the answer to work on Python 2.7. – niemmi Oct 27 '16 at 03:20
Can I use lambda function? Or I must use named function? – fahadh4ilyas Oct 27 '16 at 03:25
@user7077941 You have to use named function since [lambdas can't be pickled](http://stackoverflow.com/questions/16626429/python-cpickle-pickling-lambda-functions). – niemmi Oct 27 '16 at 03:30

How to Do Data Parallelization in Python?

1 Answers1