How to speedup python function using parallel processing?

Question

I have two functions. Each function runs a for loop.

def f1(df1, df2):
    final_items = []
    for ind, row in df1.iterrows():
        id = row['Id']
        some_num = row['some_num']
        timestamp = row['Timestamp']
        res = f2(df=df2, id=id, some_num=some_num, timestamp=timestamp))
        final_items.append(res)

return final_items

def f2(df, id, some_num, timestamp):
    for ind, row in df.iterrows():        
        filename = row['some_filename']
        dfx = reader(key=filename) # User defined; object reader
        # Assign variables
        st_ID = dfx["Id"]
        st_some_num = dfx["some_num"]
        st_time_first = dfx['some_first_time_variable']
        st_time_last = dfx['some_last_time_variable']        

        if device_id == st_ID and some_num == st_some_num:
            if st_time_first <= timestamp and st_time_last >= timestamp:
                return filename
            else:
                return None
        else:
            continue

The first function calls the second function as shown. The first loop occurs 2000 times, i.e., there are 2000 rows in the first dataframe.

The second function (the one that is called from f1()) runs 10 Million times.

My objective is to speed up f2() using parallel processing. I have tried using python packages like Multiprocessing and Ray but I am new to the world of parallel processing and am running into a lot of roadblocks due to lack of experience.

Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?

At first glance, this looks like something that could be sped up with vectorized pandas or numpy operations before worrying about parallelizing it. Could you provide some sample input and output, as in [this article on good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)? — G. Anderson, Aug 05 '19 at 20:49
Yes, this looks like something that you should try to use `pandas` idioms on before trying to parallelize. At the very least, don't use `.itterows` which is dreadfully slow, and indeed, consider maybe not using pandas at all and see how much of a speed up you would get with the same basic approach. But fundamentally, it looks like you have some O(M*N) algorithm, using indexing, this maybe be reducible considerably. But please provide a [mcve]. Note, `multiprocessing` isn't magic you sprinkle on to a function to make it faster. — juanpa.arrivillaga, Aug 05 '19 at 21:00
In `f2`, you select rows based on criteria. For that you could use the `DataFrame.loc` attribute, as shown in the accepted answer to [this question](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas). — Roland Smith, Aug 05 '19 at 21:15
Python uses a Global Interpreter Lock (GIL) this means even if you have multiple cores and CPUs, only one thread at a time will be interpreted. it will switch to other threads giving appearance of concurrent execution, but there will be no speed up from using threads. the only way to take advantage of multiple cores and CPUs is to use multi*processing*. Python does have modules to support this. — Skaperen, Aug 05 '19 at 22:00

score 0 · Answer 1 · edited Oct 01 '22 at 12:15

_{FACTS : initial formulation asks 2E3 rows in f1() to request f2() to scan 1E7 rows in "shared" df2,
so as to get called an unspecified reader()-process to receive some other data to decide about further processing or return}

My objective is to speed up f2() using parallel processing
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?

Surprise No.1 : This is NOT a use-case of parallel-processing

The problem, as-is formulated above, calls many times file-I/O operations, that are never true-[PARALLEL] down there on the physical storage level, are they? Never. Any and all smart file-I/O-(pre)-caching and sliding-window file-I/O tricks cease to help on even moderate levels of a just-[CONCURRENT] workloads and often wreak havoc if going a single step beyond that principal workload ceiling due to physically limited scope of memory resources and I/O-bus width x speed and the weakest chain element's latency increasing under still growing traffic-loads.

The workflow controlling iterators are pure-[SERIAL] "Work Dispatchers" that sequentially step through their domain of values, one after another, and order just another file to get ( again iteratively ) processed.

Surprise No.2 : Vectorisation will NOT help

While vectorised operations are smart for many vector/matrix/tensor processing schemes ( love using numpy + numba ), the Condicio Sine Qua Non is, that the problem has to be:

"compact" - so that it gets easily expressed by vectorising syntax-tricks, which this original [SERIAL]-row-after-row-after-row to find a first and only first "device_ID match" in a "remote"-file-content, next return None if not ( <exprA> and <exprB> ) else filename
"uniform", i.e. non-sequential "until" something first happens - the vectorisation is great to "cover" the whole N-dimensional space with smart-internal code for (best) orthogonal-sub-structures processing uniformly "across" the whole space. On the contrary here, the vectorisation is hard to re-sequentialise "back" to stop (poison) it from any further smart-producing results right after the first occurrence was matched... (ref.1 above "find first and only first occurrence ( and die / return ) )
"memory-adequately-sized", i.e. given any add-on logic is added to the vectorised task, whenever a code asks vectorisation engine to process N-dim "data" using some sort of where(...)-clause, the interim product of such where(...)-condition is consuming additional [SPACE]-footprint ( best in RAM, worse in SWAP-file-I/O ) and this additional memory-footpring may soon devastate any and all benefits from the idea of vectorised processing re-formulation ( not speaking about the cases that due to such immense additional memory-allocation needs result but in a swap-file-I/O suffocation of the whole process flow ) where(...)-clause over a 10E6 rows is expensive, the more once the global strategy is to execute that 1 < nCPUs < 2E3 many times ( as noted above, vectorisation goes uniformly "across" the whole range of data, no sequentially beneficial shortcuts to stop after a first and only the first match... )

THE BEST NEXT STEP : dependency-graph -> latencies -> bottleneck

The problem as-is formulated above is a just-[CONCURRENT] processing, where the actual blocking or availability of "shared" resources' usage limits the overall processing duration. Having no more than a given set of resources to use, there are no magic chances to speed-up the concurrent usage patterns for faster processing. Thus the "amounts" of free-resources to harness and their respective response-"latencies" _{sure, those under-high-levels-of-concurrent-workloads, not the idealistic, unloaded, response times}

If you have no profiling data, measure/benchmark at least the main characteristic durations:

a) the net f2()-per-row process latency [ min, Avg, MAX, StDev] in [us]

b) the reader()-related setup/retrieve latency [ min, Avg, MAX, StDev] in [us]

test, whether the reader()'s performance represents or not a bottleneck - a ceiling for the any-increased-concurrency operated process-flow

If it does, you get it's maximum workload it can handle and based on this, the concurrent-processing may get the speed forwards up to this reader()-determined performance ceiling.

All the rest is elementary.

Epilogue

Such latency-data engineered, (un)avoidable bottleneck-aware right-sized concurrent processing setup for a maximum Latency Masking is about the maximum one can expect here to help.

Given a chance to re-engineer and re-factor the global strategy, there might be much faster processing times, but that may come from other than a pure-[SERIAL] tandem of sequential iterators instructing the sequence of about ~ 20.000.000.000 calls to an unknown reader()-code.

Yet, that goes ways beyond the scope of this Stack Overflow MinCunVE-problem definition.

Hope this might have sparked some fresh views on how to make the results faster. Smart ideas may lead to processing times from a few days down to a few minutes (!). Having gone this way a few times, no one will believe how fulfilling this hard work may get both you and your customer(s), if you hit such a solution by designing the right-sized solution for their business domain.

How to speedup python function using parallel processing?

1 Answers1

Surprise No.1 : This is NOT a use-case of parallel-processing

Surprise No.2 : Vectorisation will NOT help

THE BEST NEXT STEP : dependency-graph -> latencies -> bottleneck

Epilogue