I have a readonly very large dataframe and I want to do some calculation so I do a multiprocessing.map and set the dataframe as global. However, does this imply that for each process, the program will copy the dataframe separately(so it will be fast then a shared one)?
-
Note that it is completely unsafe to use multiprocessing if you have ever created any threads. If any random lock was locked by some other thread (and there are a *lot* of locks), the forked processes will deadlock when they try to access it. – o11c Aug 30 '17 at 03:12
2 Answers
If I understand it correctly you won't get any benefit by trying to use multiprocessing.map on a Pandas DataFrame because the DataFrame is built over NumPy ndarray structures and NumPy already releases the GIL, scales to SMP hardware, uses vectorized machine instructions where they're available and so on.
As you say, you might be incurring massive RAM consumption and data copying or shared memory locking overhead on the DataFrame structures for no benefit. Performance consideration on the combination of NumPy and Python's multiprocessing module are discussed in this SO Question: Multiprocessing.Pool makes Numpy matrix multiplication slower.
The fact that you're treating this DataFrame as read-only is interesting because it suggests that you could write code around os.fork() which, due to the OS CoW (copy-on-write) semantics through the fork() system call, should be an inexpensive way to share the data with child processes, allowing each to then analyze the date in various ways. (Any code which writes to the data would trigger allocation of fresh pages and copying, of course).
The multiprocessing module is using the fork() system call under the hood (at least on Unix, Linux and similar systems). If you create and fully populate this large data structure (DataFrame) before you invoke any of the multiprocessing functions or instantiate any of its objects which create subprocesses, then you might be able to access the copies of the DataFrame which each process implicitly inherited. I don't have the time to concoct a bit of test code right now; but that might work.
As for consolidating your results back to some parent or delegate process ... you could do that through any IPC (inter-process communications) mechanism. If you were able to share the data implicitly by initializing it before calling any multiprocessing forking methods, then you might be able to simple instantiate a multiprocessing.Queue and feeding results through it. Failing that, personally I'd consider just setting up an instances of Redis either on the same system or on any other system on that LAN segment. Redis is very efficient and extremely easy to configure and maintain, with APIs and Python modules (with automatic/transparent support for hiredis for high performance deserialization of Redis results).
Redis might also make it easier to distribute your application across multiple nodes if your needs take you in that direction. Of course by then you might also be looking at using PySpark which can offer many features which map fairly well from Pandas DataFrames to Apache Spark RDD Sets (or Spark SQL "DataFrames"). Here's an article on that from a couple years ago: Databricks: From Pandas to Apache Spark's DataFrames.
In general the whole point of Apache Spark is to distribute data computations across separate nodes; this is inherently more scaleable than distributing them across cores within a single machine. (Then, the concerns boil down to I/O to the nodes so each can get its chunks of the data set loaded. That's a problem which is well suited to HDFS.
I hope that helps.

- 17,054
- 13
- 68
- 116
Every sub-process will have its own resources, so this imply does. More precisely, every sub-process will copy a part of original dataframe, determined by your implementation.
But will it be fast then a shared one? I'm not sure. Unless your dataframe implement a w/r lock, read a shared one or read separated ones are the same. But why a dataframe need to lock read operation? It doesn't make sense.

- 18,892
- 11
- 54
- 87