How can I use Multi-threading or parallel processing to reduce the running time?

Question

I have nested loop and want to check the sum of outer loop values with each inner loop values. I am getting the desired result but it is taking couple of hours. Is there any way I can reduce the time.

I am using df.iterrows() to iterate through all the rows. df1 size is 1 million and df2 size is 1000.

It would be really helpful if the time can be reduced to 5-10 mins or even less because same work needs to be repeated on daily basis.

This is how the dataframe looks like:

df1......
       col1      col2  NEWVALUE
0  0.727900  0.007912       NaN
1  0.249418  0.087288       NaN
2  0.592969  0.443518       NaN
3  0.832903  0.101647       NaN
4  0.129666  0.321423       NaN
df2...
       col1      col2  OLDVALUE
0  0.176620  0.857886        43
1  0.758241  0.086826       609
2  0.855264  0.959226       388
3  0.929884  0.349760       137
4  0.693689  0.375171         0

Here is the code:

list_values = []
for idx, xitems in df1.iterrows():
    savVal = -1
    i = 99
    for idy, yitems in df2.iterrows():
        value = xitems[‘col1’] + xitems[‘col2’] + yitems[‘col1’] + yitems[‘col2’]
        #it only runs for the first time to store the value into savVal
        if savVal == -1:
            savVal = value

        else:
            if value <= 1 and value < savVal:
                savVal = value
                i = idy
                break
    if i == 99:
        #df1.iat[idx , ‘NEWVALUE’] = “LESSTHAN”
        #in case above code throws error then alternative is list
        list_values.append(“LESSTHAN”)
    else:
        #df1.iat[idx, ‘NEWVALUE’] = df2.loc[i, ‘OLDVALUE’]
        list_values.append(df2.loc[i, ‘OLDVALUE’])

You had some tabs mixed with 4-space indents. NEVER, and I'll say it again, NEVER mix them. — Adirio, Aug 28 '19 at 10:24
@RomanPerekhrest Sorry, I forgot to mention, its 1 million for df1 and 1000 for df2 — Rajat Gupta, Aug 28 '19 at 10:27
@RajatGupta, looks like `savDist` and `dist` are not used anywhere — RomanPerekhrest, Aug 28 '19 at 10:30
Would you mind to post headers and a few lines from either of the dataframes and what is your **target run-time that you expect** to gain ( expressed in [s] )? Btw. threads do not help here, as central GIL-lock stepping re-[SERIAL]-ises all the processing back to pure-[SERIAL] sequence of code-executions "blocks", executed only after a thread has acquired the central GIL-lock... — user3666197, Aug 28 '19 at 11:30
Based on your code I think it's possible to do it using matrix. I believe there should be a way to construct a 1mm * 1000 matrix by summing up the number you provide using numpy. — clide, Aug 28 '19 at 11:46
You could use [multiprocessing package for the outer for loop](https://stackoverflow.com/a/40428356/2672299). Please also note that [iterrows is generally not recommended with pandas](https://stackoverflow.com/a/24871316/2672299). — user2672299, Aug 28 '19 at 11:59
currently `df1.iat...` throws an error `ValueError: iAt based indexing can only have integer indexers` — RomanPerekhrest, Aug 28 '19 at 12:02
@RajatGupta are you able to summarise in words what you are trying to do as it is hard to tell from your code? Your `list_values` is somehow the first value in each iteration less `<= 1` and less than the 0th iteration of df2? — tomjn, Aug 28 '19 at 12:16

score 1 · Accepted Answer · answered Aug 28 '19 at 13:14

As mentioned in the comments, you should try to avoid iterrows and think about this in terms of a matrix problem. My first step would be to compute the sum of "col1" and "col2" for each Dataframe separately

df1["sum_col"] = df1["col1"] + df1["col2"]
df2["sum_col"] = df2["col1"] + df2["col2"]

These can then be added together with a bit of numpy magic to get all possible sums of two numbers

all_values = (df1["sum_col"].values[np.newaxis].T +
              df2["sum_col"].values[np.newaxis])

all_values will now have shape (1000000, 1000) which is all possible sums of the two columns.

Now, the next part is where I'm not so clear what you are trying to do... so correct me if I'm wrong. It looks to me like you are setting savVal to the first value of each iteration of df2(?) in this case it should have a shape of 1000000, so we can do

sav_val = all_values[:, 0]

Then we want to find the first(?) value of your inner loop that is less or equal to 1 and less than sav_val. Let's find if these conditions are met separately

less_than_one = np.less_equal(all_values, 1)

and

less_than_sav_val = np.less(all_values.T, sav_val).T

The .Ts are transposes which help us broadcast to the right shape.

We can combine our two conditions and find the first True value in each row using argmax (see this question), if there is no True value we will get the first entry in each row (index 0)

passes_condition = less_than_one & less_than_sav_val
result = df2['OLDVALUE'].values.take(passes_condition.argmax(axis=1))

Ok, almost there. result has shape of 1000000. We can now replace those entries where we didn't have an entry with a value <= 1 and < the first iteration. We'll set them to -999 for now.

result[~passes_condition.any(axis=1)] = -999

result has a shape of 1000000

Putting it all together

def rajat_func(df1, df2):
    list_values = []
    for idx, xitems in df1.iterrows():
        savVal = -1
        i = 99
        for idy, yitems in df2.iterrows():
            value = xitems['col1'] + xitems['col2'] + yitems['col1'] + yitems['col2']
            #it only runs for the first time to store the value into savVal
            if savVal == -1:
                savVal = value
            else:
                if value <= 1 and value < savVal:
                    savVal = value
                    i = idy
                    break
        if i == 99:
            #df1.iat[idx , ‘NEWVALUE’] = “LESSTHAN”
            #in case above code throws error then alternative is list
            list_values.append(-999)
        else:
            #df1.iat[idx, ‘NEWVALUE’] = df2.loc[i, ‘OLDVALUE’]
            list_values.append(df2.loc[i, 'OLDVALUE'])
    return list_values

def new_func(df1, df2):
    x = (df1["col1"] + df1["col2"]).values
    y = (df2["col1"] + df2["col2"]).values
    all_values = (x[np.newaxis].T + y[np.newaxis])
    sav_val = all_values[:, 0]
    less_than_one = np.less_equal(all_values, 1)
    less_than_sav_val = np.less(all_values.T, sav_val).T
    passes_condition = less_than_one & less_than_sav_val
    result = df2['OLDVALUE'].values.take(passes_condition.argmax(axis=1))
    result[~passes_condition.any(axis=1)] = -999
    return result

Testing with df1 with 1000 rows and df2 with 100 rows.

all(new_func(df1, df2) == rajat_func(df1, df2))

is True.

%timeit rajat_func(df1, df2)

gives

5.07 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit new_func(df1, df2)

gives

601 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So quite an improvement! Running %time on new_func using a df1 with 1,000,000 rows and df2 with 1000 rows gives

CPU times: user 4.9 s, sys: 3.05 s, total: 7.96 s
Wall time: 7.99 s

Does this solve you problem, or have I completely misunderstood what you are trying to do?

Nice work, **`+1`** for benchmarking the proposed results against the original code. Next time one ought **always avoid** a test to get principally skewed from cache re-use artifacts (these do not scale to production grade scales in `[SPACE]`-domain and the `[TIME]`-domain scaling will dramatically change, both in value & in nature), so instead of running tests over just 1000-rows, there ought be some reasonably larger chunk, that is sure to get evicted from the recent **`~ 10..70+ [MB]`** L3-cache, as the production sized runs will almost always leave the comfort of an inCache-computing — user3666197, Aug 28 '19 at 16:40
Thank you so much for this. It will save a lot of time for me. — Rajat Gupta, Aug 29 '19 at 09:17

How can I use Multi-threading or parallel processing to reduce the running time?

1 Answers1

Putting it all together