4

We have about 10 Python processes running on a Linux box, all reading the same large data-structure (which happens to be a Pandas DataFrame, essentially a 2D numpy matrix).

These processes must respond to queries as quickly as possible, and keeping the data on disk is simply not fast enough for our needs anymore.

What we really need is for all the processes to have full random access to the data-structure in memory, so they can retrieve all elements necessary to perform their arbitrary calculations.

We cannot duplicate the data-structure 10 times (or even twice) in-memory due to its size.

Is there a way all 10 Python processes can share random access to the data-structure in memory?

Dun Peal
  • 16,679
  • 11
  • 33
  • 46
  • 1
    Are the processes only reading from the DataFrame, or are they both reading and writing (modifying the DataFrame)? And what OS are you using? – unutbu Jan 29 '15 at 18:37
  • @unutbu: They are only reading - the data is written by a different process, and is essentially static and read-only once written. Also, this is Linux (Ubuntu Server 14.04). – Dun Peal Jan 29 '15 at 18:40

2 Answers2

7

Because Linux supports Copy-on-Write (COW) on fork(), data is not copied unless it is written to.

Therefore, if you define the DataFrame, df in the global namespace, then you can access it from as many subsequently spawned subprocesses as you wish, and no extra memory for the DataFrame is required.

Only if one of the subprocesses modifies df (or data on the same memory page as df) is the data (on that memory page) copied.

So, as strange as it may sound, you don't have to do anything special on Linux to share access to a large in-memory data structure among subprocesses except define the data in the global namespace before spawning the subprocesses.

Here is some code demonstrating Copy-on-Write behavior.


When data gets modified, the memory page on which it resides gets copied. As described in this PDF:

Each process has a page table which maps its virtual addresses to physical addresses; when the fork() operation is performed, the new process has a new page table created in which each entry is marked with a ‘‘copy-on- write’’ flag; this is also done for the caller’s address space. When the contents of memory are to be updated, the flag is checked. If it is set, a new page is allocated, the data from the old page copied, the update is made on the new page, and the ‘‘copy-on-write’’ flag is cleared for the new page.

Thus, if there is an update to some value on a memory page, that page is copied. If part of a large DataFrame resides on that memory page, then only that part gets copied, not the whole DataFrame. By default the page size is usually 4 KiB but can be larger depending on how the MMU is configured.

Type

% getconf PAGE_SIZE
4096

to find the page size (in bytes) on your system.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks! Are there any special stipulations about when the COW would happen? For example, what if the top-level `__main__` process includes a whole bunch of global data. The COW explanation you linked referred to "memory pages". Is is possible that one of the forked processes will end up triggering a full copy of the `df` just because it tried to (over)write any of the much smaller data-structures inherited from `__main__`? – Dun Peal Jan 29 '15 at 20:40
  • 1
    @DunPeal: Only the affected memory page is copied. I've added a bit more detail about this above. – unutbu Jan 29 '15 at 21:05
  • Great! One final question for you: do I have to spawn those workers using `multiprocessing.Process` as in your example, so they will have access to the global scope of the `__main__`, or is there another way to go about providing them such access? For example, is there any way to make it work with the less fancy `subprocess.Popen`? – Dun Peal Jan 29 '15 at 23:28
  • 1
    You'll need to use `multiprocessing` or (in Python3) `concurrent.futures`. [Here is a great tutorial](http://pymotw.com/2/multiprocessing/index.html) to get you started. – unutbu Jan 30 '15 at 00:15
0

The other solution has the considerable problem of requiring the data frame to be static at all times, and requiring Linux.

You should look at Redis, which, while not quite as fast as a native python in-memory structure, is still super fast on the same machine for querying of table like objects. If you want ultimate speed though via a shared-memory structure look at the nascent Apache Arrow project.

Both offer the considerable benefits of dynamic data capability, and cross-language support, while Redis will also allow you to go multi-node.

Thomas Browne
  • 23,824
  • 32
  • 78
  • 121