Does slicing dataframes in multiprocess / multiprocessing improve performance?

Question

I need to do some computation on different slices of some big dataframes.

Suppose I have 3 big dataframes df1, df2 and df3.
Each of which has a "Date" column.
I need to do some computation on these dataframes, based on date slices and since each iteration is independent from the other iteration, I need to do these iterations concurrently.

df1              # a big dataframe
df2              # a big dataframe
df3              # a big dataframe

So I define my desired function and in each child process, first a slice of df1, df2, df3 is created, within the process, then other computations go on.

Since df1,df2 and df3 are global dataframes, I need to specify them as arguments in my function. Otherwise it won't be recognized.

Something like below:

slices = [ '2020-04-11', '2020-04-12', '2020-04-13', ]
# a list of dates to get sliced further

def my_func(slice,df1=df1,df2=df2,df3=df3):
    sliced_df1 = df1[df1.Date >  slice]
    sliced_df2 = df2[df2.Date <  slice]
    sliced_df3 = df3[df3.Date >= slice]
    #
    # other computations
    # ...
    #
    return desired_df

The concurrent processing is configured as below :

import psutil
pool = multiprocess.Pool(psutil.cpu_count(logical=False))

final_df = pool.map(my_func,[slice for slice in slices])
pool.close()
final_df = pd.concat(final_df, ignore_index = True)

However, it seems that only one core goes up upon the execution.

I suppose that since each child process wants to access the global dataframes df1, df2 and df3, there should be a shared memory for child process and as I searched through net, I guess I have to use the multiprocessing.manager(), but I am not sure how to use it or if I am right about using it?

I am actually new to the concept of concurrent processing and I appreciate if someone can help.

PS: It seems that my question is similar to this post. However, it hasn't an accepted answer.

Do you use ``multiprocess`` or ``multiprocessing``? Note that there is no such thing as global data for a multiprocessing application - each process has its own globals *copied* between the processes on startup. You can explicitly share them (e.g. via a Manager) but this leads to synchronisation, which removes the advantage of multiprocessing. Is there a conceptual reason why all the data is owned by the main process, instead of each process reading/fetching just its own data concurrently? — MisterMiyagi, Apr 22 '20 at 11:55
@MisterMiyagi since I am new to the concept of parallel processing, I am not sure if I understand the difference between multiprocessing and multiprocess. And the reason that all data is owned by main process, is that I have a huge text files and I read them once into the main process and want to slice them and then compute on them in parallel. I couldn't manage a way to read slices from the text files based on condition on "Date" column, so that I don't need to read them all at once. — mpy, Apr 22 '20 at 12:01
The standard library module is called ``multiprocessing``. ``multiprocess`` is a third-party module. — MisterMiyagi, Apr 22 '20 at 12:03
@MisterMiyagi OK Thanks... Is there a solution to read each slice from the text files and not read them all in the main process? — mpy, Apr 22 '20 at 12:04
@MisterMiyagi You may already know, that **your claimed full-copy** *works **ONLY** in some particular cases* ( it is default and the only mode only for Windows, optionally on explicit request for Linux / MacOS ) — user3666197, Apr 22 '20 at 12:44

user3666197 · Answer 1 · 2020-04-22T11:27:23.323

Q : "... am ( I ) right about using it?"

A) `multiprocess != multiprocessing`

Well, you might already noticed, that multiprocessing-module is not the same thing as the Mike McKerns' multiprocess-module. ( The promises and properties of the latter remain these days hidden, as the module maintainers have published the module Documentation on the RTFM site _{( yes, the ReadTheDocs originally used the Read-The-F*****g-Manual site identity, so not my idea )}, which is currently not -as-of-2020-04- accessible, as the address translation to the https://multiprocess.readthedocs.io/en/latest/ fails to render any multiprocess module-relevant content at all ( would be great, if @Mike McKerns would step in and repair this deficit, as otherwise Pathos, Dill and other pieces of his published work are known for being fabulously great pieces of software, aren't they? ).

B) You pay way more than will ever receive back

If trying to avoid duplicated DataFrame-instances, motivated by RAM ceiling, you cannot use process-based multiprocessing. This yields into a trap of having but a pure-[SERIAL], central GIL-lock dictatorship ordinated one-step-after-another-step mode-of-execution ( no increase of performance, but the very contrary ... paying add-on costs for thread-multiplexing, all waiting both for own holding the GIL-lock and also for a free RAM-I/O to get some data to fetch / chew / store ( process ) ).

If trying to enter into linux-fork spawned processes ( bearing risks of doing that - seed docs / issues on details ) and use Manager()-instance ( hey, yes, again the GIL-lock will block but one CPU-core - we've heared about this already, haven't we ? ) or other similar syntax-only tricks to access or to orchestrate an access to global-s or "shared" objects from the main, again, the costs of doing so will go that wild against your wished to have performance, that (except for network I/O-latency masking use-cases) there will never be a measurable advantage, often the very contrary - you will pay way more in terms of add-on overhead costs than will ever receive back in terms of a wished-to-have increased performance.

If passing all the DataFrame-instances via parameters' transfer ( as your syntax tries to do ), your processes ( be it forked or spawned ) will have pay even more add-on overhead costs ( and all will have full-scale replicas of the original ( df1, df2, df3, ) ... yes, RAM will prove it ) plus your multiprocessing-related add-on overhead costs will here include immense and devastating SER/DES costs, as these BIG-FAT pieces of meat will have to go through SER / DES ( most often implemented via the SER-phase in pickle.dumps() + XFER-phase over a p2p-data-transfer ( data-flow ) + DES-phase in pickle.loads() ... all this executed 3 * len ( [ slice for slice in slices ] )-times, having RAM-contention ceiling, having CPU-contention throttling by central GIL-lock on the __main__-side, due to a need to execute first all the three ( SER, XFER, )-phases on its side, if RAM-permits, before "remote"-DES-phases go get ahead, most probably under heavy swap thrashing already started.

So either of these seems more like an anti-pattern for any high performance computing, doesn't it?

If it were not for all these python/GIL-related issues, the science of performance goes against you - see the speedup traps in the revised Amdahl's Law for details and quantitative methods.

In doubts ?

Benchmark, benchmark, benchmark.

Facts matter.

Does slicing dataframes in multiprocess / multiprocessing improve performance?

1 Answers1

A) multiprocess != multiprocessing

B) You pay way more than will ever receive back

In doubts ?

A) `multiprocess != multiprocessing`