7

I have the following for loop:

for j in range(len(list_list_int)):
    arr_1_, arr_2_, arr_3_ = foo(bar, list_of_ints[j])
    arr_1[j,:] = arr_1_.data.numpy()
    arr_2[j,:] = arr_2_.data.numpy()
    arr_3[j,:] = arr_3_.data.numpy()

I would like to apply foo with multiprocessing, mainly because it is taking a lot of time to finish. I tried to do it in batches with funcy's chunks method:

for j in chunks(1000, list_list_int):
    arr_1_, arr_2_, arr_3_ = foo(bar, list_of_ints[j])
    arr_1[j,:] = arr_1_.data.numpy()
    arr_2[j,:] = arr_2_.data.numpy()
    arr_3[j,:] = arr_3_.data.numpy()

However, I am getting list object cannot be interpreted as an integer. What is the correct way of applying foo using multiprocessing?

anon
  • 836
  • 2
  • 9
  • 25
  • 1
    According to the docs and my own tests, the way you are calling it _should_ work. Not sure why it doesn't, but you can try explicitly specifying a step (if you want default behaviour the step should have the same value as the first argument). – bgfvdu3w May 15 '19 at 05:12
  • is there any other alternative for applying the function? @Mark – anon May 15 '19 at 05:13
  • 1
    from `for j in chunks(1000, list_list_int):`, `j` is not integer, it is sublist of `list_list_int`, So, you need to iterate `j` again. https://i.stack.imgur.com/JXDmZ.png – Wang Liang May 15 '19 at 05:15
  • Thanks for the help @KingStone, could you show an example? – anon May 15 '19 at 05:16
  • 1
    I updated my comment with code screen. But, chunk cannot increase speed. How about https://stackoverflow.com/questions/11515944 – Wang Liang May 15 '19 at 05:17
  • I am getting a type error: list indices must be integers or slices, not list @KingStone – anon May 15 '19 at 05:21
  • Could you provide an example of how to use multiprocessing for this case? @KingStone – anon May 15 '19 at 05:22
  • What's the purpose of these lines inside a for loop? `arr_1[j,:] = arr_1_.data.numpy()` , they don't do anything (the arr_1 variable is overwritten in the next iteration) – Andy Hayden May 22 '19 at 04:59

3 Answers3

6
list_list_int = [1,2,3,4,5,6]
for j in chunks(2, list_list_int):
  for i in j:
    avg_, max_, last_ = foo(bar, i)
Wang Liang
  • 4,244
  • 6
  • 22
  • 45
3

I don't have chunks installed, but from the docs I suspect it produces (for size 2 chunks, from:

alist = [[1,2],[3,4],[5,6],[7,8]]                                     
j = [[1,2],[3,4]]
j = [[5,6],[7,8]]   

which would produce an error:

In [116]: alist[j]                                                              
TypeError: list indices must be integers or slices, not list

And if your foo can't work with the full list of lists, I don't see how it will work with that list split into chunks. Apparently it can only work with one sublist at a time.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
2

If you are looking to perform parallel operations on a numpy array, then I would use Dask.

With just a few lines of code, your operation should be able to be easily ran on multiple processes and the highly developed Dask scheduler will balance the load for you. A huge benefit to Dask compared to other parallel libraries like joblib, is that it maintains the native numpy API.

import dask.array as da

# Setting up a random array with dimensions 10K rows and 10 columns
# This data is stored distributed across 10 chunks, and the columns are kept together (1_000, 10)
x = da.random.random((10_000, 10), chunks=(1_000, 10))
x = x.persist()  # Allow the entire array to persist in memory to speed up calculation


def foo(x):
    return x / 10


# Using the native numpy function, apply_along_axis, applying foo to each row in the matrix in parallel
result_foo = da.apply_along_axis(foo, 0, x)

# View original contents
x[0:10].compute()

# View sample of results
result_foo = result_foo.compute()
result_foo[0:10]
Chris Farr
  • 3,580
  • 1
  • 21
  • 24