Because Linux supports Copy-on-Write (COW) on fork()
, data is not copied
unless it is written to.
Therefore, if you define the DataFrame, df
in the global namespace, then you
can access it from as many subsequently spawned subprocesses as you wish,
and no extra memory for the DataFrame is required.
Only if one of the subprocesses modifies df
(or data on the same memory page as df
) is the data (on that memory page) copied.
So, as strange as it may sound, you don't have to do anything special on Linux to share access to a large in-memory data structure among subprocesses except define the data in the global namespace before spawning the subprocesses.
Here is some code demonstrating Copy-on-Write behavior.
When data gets modified, the memory page on which it resides gets copied.
As described in this PDF:
Each process has a page table which maps its virtual addresses to
physical addresses; when the fork() operation is performed, the new
process has a new page table created in which each entry is marked
with a ‘‘copy-on- write’’ flag; this is also done for the caller’s
address space. When the contents of memory are to be updated, the
flag is checked. If it is set, a new page is allocated, the data from
the old page copied, the update is made on the new page, and the
‘‘copy-on-write’’ flag is cleared for the new page.
Thus, if there is an update to some value on a memory page, that page is
copied. If part of a large DataFrame resides on that memory page, then only that part gets copied, not the whole DataFrame.
By default the page size is usually 4 KiB but can be larger depending on how the MMU is configured.
Type
% getconf PAGE_SIZE
4096
to find the page size (in bytes) on your system.