0

Is there a way to share a huge dictionary to multiprocessing Subprocesses on windows without duplicating the whole memory? I only need it read-only within the sub-processes, if that helps.

My programm roughly looks like this:

def workerFunc(args):
    id, data_mp, some_more_args = args

    # Do some logic
    # Parse some files on the disk
    # and access some random keys from data_mp which are only known after parsing those files on disk ...
    some_keys = [some_random_ids...]

    # Do something with 
    do_something = [data_mp[x] for x in some_keys]
    return do_something


if __name__ == "__main__":
    multiprocessing.freeze_support()    # Using this script as a PyInstalled .exe later on ...

    DATA = readpickle('my_pickle.pkl')   # my_pickle.pkl is huge, ~1GB
    # DATA looks like this:
    # {1: ['some text', SOME_1D_OR_2D_LIST...[[1,2,3], [123...]]], 
    #  2: ..., 
    #  3: ..., ..., 
    #  1 million keys... }

    # Here I'm doing something with DATA in the main programm...

    # Then I want to spawn N multiprocessing subprocesses, each doing some logic and than accessing a few keys of DATA to read from ...

    manager = multiprocessing.Manager()
    data_mp = manager.dict(DATA)    # Right now I'm putting DATA into the shared memory... so it effectively duplicates the required memory...

    joblist = []
    for idx in range(10000): # Generate the workers, pass the shared memory link data_mp to each worker later on ...
        joblist.append((idx, data_mp, some_more_args))

    # Start Pool of Procs... 
    p = multiprocessing.Pool()
    returnNodes = []
    for ret in p.imap_unordered(workerFunc, jobList):
       returnNodes.append(ret)

    # Do some after work with DATA and returnNodes...
    # and generate some overview xls-file out of it

Unfortunately there's no other way to save my big dictionary... I know a SQL Database would be better because each worker only accesses a few keys of DATA_mp within his subproc, but I don't know in advance which keys will be adressed by each worker.

So my question is: Is there any other way on windows to do this instead of using a Manager.dict() which, as stated above already, effectively duplicates the required memory?

Thanks!

EDIT Unfortunately in my corporate environment, there's no possibility for my tool to use a SQL DB because there's no dedicated machine available. I can only work on file-basis on networkdrives. I tried SQLite already, but it was seriously slow (even though I didnt understand why...). Yes it's a simple key->value kind of dictionary in DATA...

And using Python 2.7!

tim
  • 9,896
  • 20
  • 81
  • 137
  • you really do want some sort of database like sqlite or something, but alternatives include [dbm](https://stackoverflow.com/a/11837998/7540911) and [mmap](https://docs.python.org/3.8/library/mmap.html). wait if your data is indexed by number anyway, whats the problem with dumping it into sql? – Nullman Feb 27 '20 at 14:39
  • Unfortunately in my corporate environment, there's no possibility for my tool to use a SQL DB because there's no dedicated machine available. I can only work on file-basis on networkdrives. I tried SQLite already, but it was seriously slow (even though I didnt understand why...). dbm and mmap are new to me, if those are "reasonable" alternatives to a file-based DB where a very effective access to single Dictionary keys are possible, this would be very good. – tim Feb 27 '20 at 15:05
  • but [sqlite](https://docs.python.org/2.7/library/sqlite3.html) is a FILE, its not a server based solution – Nullman Feb 27 '20 at 15:08
  • Yes I know, that's why I tried it, but it was seriously slow to be accessed while being placed on our network-drives. – tim Feb 27 '20 at 15:10
  • If you want linear speedup you should partition the data so each worker has the data it needs, not a full copy. If the workers need access to common data you should look to multithreading solutions instead. Cross-process communication is expensive no matter what. – Panagiotis Kanavos Feb 27 '20 at 15:16
  • Unfortunately I dont know in advance which keys each worker needs to access. This only is known within the worker after parsing some individual files again. Thanks, maybe I'll check the suggested "dbm", so that no cross-process communication of the whole dictionary `DATA` will be required anymore, instead just a link to a shared sort of "database"-connection and each worker can hopefully access the required keys pretty fast. – tim Feb 27 '20 at 15:22
  • Here's a simple library that I put together to serve data to multiple python processes [PythonDataServe](https://github.com/bjascob/PythonDataServe). You have to run it as a separate python executable to serve the data but sounds like it might work for you, although I haven't tested it under python 2. – bivouac0 Feb 27 '20 at 21:32

0 Answers0