Parallelization of large tasks in Python function

Question

I have a function that I want to multithread/parallelize in Python3. The df.myfunc(c1,c2) function takes a long time to compute, and thus I would like to parallelize it, to speed up the computation for larger datasets.

def multi_thread_func(df):
    cols = df.schema.names
    length = len(cols)
    a = np.zeros((length * length))

    with multiprocessing.Pool() as pool:
        i = 0
        for value in pool.starmap(df.myfunc, itertools.product(cols, repeat=2)):
            a[i] = None if value is None else value
            i += 1
    return a

The specific error I am getting is:

TypeError: cannot pickle '_thread.lock' object

Are you using a Spark dataframe (please tag the df-framework)? If so, you might be on the wrong track. Look [here](https://stackoverflow.com/q/38048068/14311263) for example. — Timus, Mar 30 '23 at 07:38

score 1 · Accepted Answer · answered Mar 27 '23 at 23:49

1

def multi_thread_func(df):
    length = len(df.cols)
    a = np.zeros((length * length))

    with multiprocessing.Pool() as pool:
        i = 0
        for value in pool.starmap(calculate, itertools.product(range(length), repeat=2)):
            a[i] = value
            i += 1

    return a

answered Mar 27 '23 at 23:49

Frank Yellin

9,127
1
12
22

The above is returning the error: TypeError: cannot pickle '_thread.lock' object. – Olivia Mar 29 '23 at 20:17
Don't know. It runs just fine for me. What is it trying to pickle? – Frank Yellin Mar 30 '23 at 05:33

Parallelization of large tasks in Python function

1 Answers1