25

I have a python program where I need to load and de-serialize a 1GB pickle file. It takes a good 20 seconds and I would like to have a mechanism whereby the content of the pickle is readily available for use. I've looked at shared_memory but all the examples of its use seem to involve numpy and my project doesn't use numpy. What is the easiest and cleanest way to achieve this using shared_memory or otherwise?

This is how I'm loading the data now (on every run):

def load_pickle(pickle_name):
    return pickle.load(open(DATA_ROOT + pickle_name, 'rb'))

I would like to be able to edit the simulation code in between runs without having to reload the pickle. I've been messing around with importlib.reload but it really doesn't seem to work well for a large Python program with many file:

def main():
    data_manager.load_data()
    run_simulation()
    while True:
        try:
            importlib.reload(simulation)
            run_simulation()
        except:
        print(traceback.format_exc())
        print('Press enter to re-run main.py, CTRL-C to exit')
        sys.stdin.readline()
Will Da Silva
  • 6,386
  • 2
  • 27
  • 52
etayluz
  • 15,920
  • 23
  • 106
  • 151
  • What is the data? Do you need to load all of it in one hit? – mkst Jun 08 '21 at 14:31
  • Yes - it's financial data - and the simulation has to process all of it – etayluz Jun 08 '21 at 14:34
  • If your data was stored as dataframes I'd suggest something like `vaex`. Can you edit your question to show an example of the data? – mkst Jun 08 '21 at 14:40
  • 1
    It appears that `shared_memory` stores information as a buffer of bytes. If you aren’t trying to share an array, then you would likely have to reserialize the data again for saving there. – thshea Jun 08 '21 at 14:43
  • If disk read time is a large part of the load time, another (non-Python) option is to load the pickled data into a ramdisk using whatever utility is relevant for your system – thshea Jun 08 '21 at 14:49
  • @thshea can you show an example code of how to use ramdisk? – etayluz Jun 08 '21 at 14:56
  • Pickle doesn't seem like a good choice of serialization format if your data is that large. A format that knows what objects are allowed to point to others will to do much less work.. Regardless, you shouldn't be afraid of adding `import numpy`; it is useful for all sorts of programs, even if you only use a tiny part of it. – o11c Jun 11 '21 at 22:46
  • 4
    I don't understand what problem you are trying to solve. If the data needs to be "readily available", then why is it getting pickled in the first place - as opposed to just keeping hold of the objects? Why is the program being restarted, especially if there is a need to avoid loading times? – Karl Knechtel Jun 11 '21 at 22:47
  • @KarlKnechtel the pickle contains processed financial data that I do not wish to re-process on every run of the simulation - because that would make each run terribly long. That is why the pickle is needed. The program is being restarted each time because it is in development – etayluz Jun 11 '21 at 23:09
  • @etayluz I am not sure how applicable it is to your use case, but is it an option to use something like Jupyter notebook? I've used it on datasets in the past (about 400 MB) for the same reason. But admittedly, it doesn't work for all use cases. – bobveringa Jun 11 '21 at 23:47
  • 1
    Is there anything stopping you from having a master program and reformatting the simulations as a class to be imported? Then have the main program run all the time (and start on boot) with the data loaded, and any time you want to simulate, *reimport the new simulation class (if possible), copy the data, and pass it in. – thshea Jun 12 '21 at 00:16
  • You can use the [reload function](https://stackoverflow.com/a/1254379/11789440) to accomplish that behavior – thshea Jun 12 '21 at 00:18
  • @thshea - importlib.reload doesn't work for large programs. It works well for a few small files – etayluz Jun 12 '21 at 03:51
  • @etayluz Help me understand your question: so is it like _you have a process that pre-processes the data and pickles it and dumps it to a file. Now, you have another process that is supposed to read and unpickle this - but faster than 20 seconds?_ And is this why you are looking at **shared memory** so that the processes can share the data directly? – anurag Jun 12 '21 at 21:02
  • I know this works around load times rather than shared memory, but could you give cpickle a try instead of pickle? (just replace all references). This will break with some data, but otherwise it will work fine. You could look into using a different pickling protocol to reduce disk reads, or a different form of serialization such as JSON (especially with cJSON in python), or if you can use numpy (which can easily replace nested lists), the array read and write functions. – thshea Jun 12 '21 at 22:46
  • 2
    You say your code doesn't use `numpy`, but what *does* it use? What is this massive data structure you need to save between runs? You're not going to be able to save entire Python objects into some kind of shared memory space, you'd horribly break the interpreter's memory management if you tried. But depending on what your data actually is, you might be able to share something, we just can't know what it will be without knowing something about the data. – Blckknght Jun 13 '21 at 01:25
  • Do you really need to work on a full-sized, "real" data set *while the program is still in development*? Why? – Karl Knechtel Jun 14 '21 at 00:43
  • @anurag - you are exactly right – etayluz Jun 14 '21 at 19:36
  • @Blckknght- my program is a stock trade simulator. It draws on financial data going back to 1995 to the current day - daily open/close price of each stock for every trading day going back 26 years. All this data is necessary on each run of the program. – etayluz Jun 14 '21 at 19:40
  • Maybe you need a proper database like SQL or similar? – Chris_Rands Jun 15 '21 at 21:29
  • What's the structure of your data. It is using any specific python object or is it just list, maps, numbers and strings? – pbacterio Jun 16 '21 at 13:39
  • it's just a list of dictionaries – etayluz Jun 16 '21 at 18:22
  • as you basically want to be able to load data from memory, I would recommend to store it in redis. I would first try to dump the list of dicts into one json, store it in redis (in memory db - another option could be memcache) and then load it from there instead of from a pickled object - if that is still not fast enough, store each list item as single object in redis and load all of them in parallel see also: https://stackoverflow.com/questions/32276493/how-to-store-and-retrieve-a-dictionary-with-redis – Pablo Henkowski Jun 18 '21 at 19:02

9 Answers9

7

This could be an XY problem, the source of which being the assumption that you must use pickles at all; they're just awful to deal with due to how they manage dependencies and are fundamentally a poor choice for any long-term data storage because of it

The source financial data is almost-certainly in some tabular form to begin with, so it may be possible to request it in a friendlier format

A simple middleware to deserialize and reserialize the pickles in the meantime will smooth the transition

input -> load pickle -> write -> output

Converting your workflow to use Parquet or Feather which are designed to be efficient to read and write will almost-certainly make a considerable difference to your load speed

Further relevant links


You may also be able to achieve this with hickle, which will internally use a HDH5 format, ideally making it significantly faster than pickle, while still behaving like one

ti7
  • 16,375
  • 6
  • 40
  • 68
  • I don't know why but hickle is NOT a drop in replacement for pickle - I had to rewrite the code - and then it was super duper slow – etayluz Jun 14 '21 at 20:35
  • definitely not a drop-in, but such a solution can assuage politics because it's easily comparable – ti7 Jun 15 '21 at 18:09
6

An alternative to storing the unpickled data in memory would be to store the pickle in a ramdisk, so long as most of the time overhead comes from disk reads. Example code (to run in a terminal) is below.

sudo mkdir mnt/pickle
mount -o size=1536M -t tmpfs none /mnt/pickle
cp path/to/pickle.pkl mnt/pickle/pickle.pkl 

Then you can access the pickle at mnt/pickle/pickle.pkl. Note that you can change the file names and extensions to whatever you want. If disk read is not the biggest bottleneck, you might not see a speed increase. If you run out of memory, you can try turning down the size of the ramdisk (I set it at 1536 mb, or 1.5gb)

thshea
  • 1,048
  • 6
  • 18
  • Note that this is only for linux (specially ubuntu; I’m not sure how where it generalizes to). If you are on windows or mac, you will need to follow a different process. – thshea Jun 08 '21 at 15:08
  • This looks interesting - but my program needs to run on Windows as well. I need a cross platform solution – etayluz Jun 08 '21 at 15:12
3

You can use shareable list: So you will have 1 python program running which will load the file and save it in memory and another python program which can take the file from memory. Your data, whatever is it you can load it in dictionary and then dump it as json and then reload json. So

Program1

import pickle
import json
from multiprocessing.managers import SharedMemoryManager
YOUR_DATA=pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
data_dict={'DATA':YOUR_DATA}
data_dict_json=json.dumps(data_dict)
smm = SharedMemoryManager()
smm.start() 
sl = smm.ShareableList(['alpha','beta',data_dict_json])
print (sl)
#smm.shutdown() commenting shutdown now but you will need to do it eventually

The output will look like this

#OUTPUT
>>>ShareableList(['alpha', 'beta', "your data in json format"], name='psm_12abcd')

Now in Program2:

from multiprocessing import shared_memory
load_from_mem=shared_memory.ShareableList(name='psm_12abcd')
load_from_mem[1]
#OUTPUT
'beta'
load_from_mem[2]
#OUTPUT
yourdataindictionaryformat


You can look for more over here https://docs.python.org/3/library/multiprocessing.shared_memory.html

ibadia
  • 909
  • 6
  • 15
  • 1
    Are you sure this scales? I'd expect the `Manger` code to be pickling and sending over IPC the same data the questioner needs to be efficiently available, so having it pre-loaded in one program may not add anything. – Blckknght Jun 13 '21 at 01:22
  • 1
    Its preloaded in a memory. The questioner currently have to load data from DISK every time he runs the program. with this approach the data will be loaded in memory and a reference will be given for another program to load that data. he needs something which takes file from memory. and this snippet is achieving that purpose. It will run for 1GB of data given that he have enough memory left after os processes – ibadia Jun 13 '21 at 01:34
  • `File "/Users/etayluz/stocks/src/data_loader.py", line 19, in main sl = smm.ShareableList(['alpha', 'beta', data_dict_json]) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/managers.py", line 1363, in ShareableList sl = shared_memory.ShareableList(sequence) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/shared_memory.py", line 308, in __init__ assert sum(len(fmt) <= 8 for fmt in _formats) == self._list_len AssertionError` – etayluz Jun 14 '21 at 21:29
  • @ibadia any idea what this error is about? – etayluz Jun 14 '21 at 21:32
3

Adding another assumption-challenging answer, it could be where you're reading your files from that makes a big difference

1G is not a great amount of data with today's systems; at 20 seconds to load, that's only 50MB/s, which is a fraction of what even the slowest disks provide

You may find you actually have a slow disk or some type of network share as your real bottleneck and that changing to a faster storage medium or compressing the data (perhaps with gzip) makes a great difference to read and writing

ti7
  • 16,375
  • 6
  • 40
  • 68
  • Thank you for the comment. I'm running locally on a 2018 MacBook Pro. No issues like that here. – etayluz Jun 14 '21 at 20:38
2

As I understood:

  • something is needed to be loaded
  • it is needed to be loaded often, because file with code which uses this something is edited often
  • you don't want to wait until it will be loaded every time

Maybe such solution will be okay for you.

You can write script loader file in such way (tested on Python 3.8):

import importlib.util, traceback, sys, gc

# Example data
import pickle
something = pickle.loads(pickle.dumps([123]))

if __name__ == '__main__':
    try:
        mod_path = sys.argv[1]
    except IndexError:
        print('Usage: python3', sys.argv[0], 'PATH_TO_SCRIPT')
        exit(1)

    modules_before = list(sys.modules.keys())
    argv = sys.argv[1:]
    while True:
        MOD_NAME = '__main__'
        spec = importlib.util.spec_from_file_location(MOD_NAME, mod_path)
        mod = importlib.util.module_from_spec(spec)

        # Change to needed global name in the target module
        mod.something = something
        
        sys.modules[MOD_NAME] = mod
        sys.argv = argv
        try:
            spec.loader.exec_module(mod)
        except:
            traceback.print_exc()
        del mod, spec
        modules_after = list(sys.modules.keys())
        for k in modules_after:
            if k not in modules_before:
                del sys.modules[k]
        gc.collect()
        print('Press enter to re-run, CTRL-C to exit')
        sys.stdin.readline()

Example of module:

# Change 1 to some different number when first script is running and press enter
something[0] += 1 
print(something)

Should work. And should reduce the reload time of pickle close to zero

UPD Add a possibility to accept script name with command line arguments

2

Here are my assumptions while writing this answer:

  1. Your Financial data is being produced after complex operations and you want the result to persist in memory
  2. The code that consumes must be able to access that data fast
  3. You wish to use shared memory

Here are the codes (self-explanatory, I believe)

Data structure

'''
Nested class definitions to simulate complex data
'''

class A:
    def __init__(self, name, value):
        self.name = name
        self.value = value

    def get_attr(self):
        return self.name, self.value

    def set_attr(self, n, v):
        self.name = n
        self.value = v


class B(A):
    def __init__(self, name, value, status):
        super(B, self).__init__(name, value)
        self.status = status

    def set_attr(self, n, v, s):
        A.set_attr(self, n,v)
        self.status = s

    def get_attr(self):
        print('\nName : {}\nValue : {}\nStatus : {}'.format(self.name, self.value, self.status))

Producer.py

from multiprocessing import shared_memory as sm
import time
import pickle as pkl
import pickletools as ptool
import sys
from class_defs import B


def main():

    # Data Creation/Processing
    obj1 = B('Sam Reagon', '2703', 'Active')
    #print(sys.getsizeof(obj1))
    obj1.set_attr('Ronald Reagon', '1023', 'INACTIVE')
    obj1.get_attr()

    ###### real deal #########

    # Create pickle string
    byte_str = pkl.dumps(obj=obj1, protocol=pkl.HIGHEST_PROTOCOL, buffer_callback=None)
    
    # compress the pickle
    #byte_str_opt = ptool.optimize(byte_str)
    byte_str_opt = bytearray(byte_str)
    
    # place data on shared memory buffer
    shm_a = sm.SharedMemory(name='datashare', create=True, size=len(byte_str_opt))#sys.getsizeof(obj1))
    buffer = shm_a.buf
    buffer[:] = byte_str_opt[:]

    #print(shm_a.name)               # the string to access the shared memory
    #print(len(shm_a.buf[:]))

    # Just an infinite loop to keep the producer running, like a server
    #   a better approach would be to explore use of shared memory manager
    while(True):
        time.sleep(60)


if __name__ == '__main__':
    main()

Consumer.py

from multiprocessing import shared_memory as sm
import pickle as pkl
from class_defs import B    # we need this so that while unpickling, the object structure is understood


def main():
    shm_b = sm.SharedMemory(name='datashare')
    byte_str = bytes(shm_b.buf[:])              # convert the shared_memory buffer to a bytes array

    obj = pkl.loads(data=byte_str)              # un-pickle the bytes array (as a data source)

    print(obj.name, obj.value, obj.status)      # get the values of the object attributes


if __name__ == '__main__':
    main()

When the Producer.py is executed in one terminal, it will emit a string identifier (say, wnsm_86cd09d4) for the shared memory. Enter this string in the Consumer.py and execute it in another terminal.

Just run the Producer.py in one terminal and the Consumer.py on another terminal on the same machine.

I hope this is what you wanted!

anurag
  • 1,715
  • 1
  • 8
  • 28
  • This was tested on Python 3.8 (via anaconda 4.8.4) under a Windows 10 x64 environment – anurag Jun 14 '21 at 14:31
  • Traceback (most recent call last): File "/Users/etayluz/stocks/src/data_loader.py", line 18, in byte_str_opt = ptool.optimize(byte_str) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/pickletools.py", line 2337, in optimize for opcode, arg, pos, end_pos in _genops(p, yield_end_pos=True): File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/pickletools.py", line 2279, in _genops code = data.read(1) AttributeError: 'NoneType' object has no attribute 'read' – etayluz Jun 14 '21 at 21:18
  • would you know what the above error is about? Something with `ptool` – etayluz Jun 14 '21 at 21:41
  • try with that statement removed. Also, try to print the length of the output of the `pkl.dumps` statement - I am guessing it is empty ( _from_ `AttributeError: 'NoneType' object ...` ) – anurag Jun 15 '21 at 06:51
  • Yes - that was my mistake I apologize. – etayluz Jun 15 '21 at 15:14
  • Can you please show how to remove the ptool line? I don't understand how to factor out byte_str_opt and just work with byte_str. ptool takes about a min and I'm not sure I need to do that. – etayluz Jun 15 '21 at 15:15
  • I'm also receiving this error. I am on Python 3.9.5 (latest). `shm_a = smm.SharedMemory(create=True, size=len(byte_str_opt))#sys.getsizeof(obj1)) TypeError: SharedMemory() got an unexpected keyword argument 'create'` – etayluz Jun 15 '21 at 15:15
  • @etayluz are you running your producer and consumer on same machine or different machines? The idea to use `pickletools.optimize` was to shorten the data string before placing it on shared mem. To remove it just comment it out and replace `byte_str_opt` with `byte_str` in the next statement! – anurag Jun 15 '21 at 15:43
  • Thanks @anurag - what about the create error above? How do I resolve this? – etayluz Jun 15 '21 at 16:41
  • @etayluz, I am pretty much certain you are mixing **@ibadia's** answer and mine. `smm` variable comes from his answer - he is using `SharedMemoryManager`. Although I haven't tried, but, this is what is throwing the error. Try to use my code entirely to see if it works! – anurag Jun 15 '21 at 18:47
  • you are right - I mixed up the two answers. – etayluz Jun 15 '21 at 20:06
  • Now I'm seeing this error: `Traceback (most recent call last): File "/Users/etayluz/stocks/src/data_loader.py", line 52, in main() File "/Users/etayluz/stocks/src/data_loader.py", line 42, in main buffer[:] = byte_str[:] ValueError: memoryview assignment: lvalue and rvalue have different structures >/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '` – etayluz Jun 15 '21 at 20:07
  • Is there a way you can share your code? I am unable to recreate the issue. I have simplified the code and would request you to try it from a clean slate! – anurag Jun 16 '21 at 09:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233853/discussion-between-etayluz-and-anurag). – etayluz Jun 16 '21 at 18:30
  • This approach is clever, but like the ramdisk approach it suffers from the fact that the simulation program has to unpickle the data on every run, which has serious performance implications for large pickles. That would not be an issue if the problem were not easily avoided, but it is relatively easy to avoid it, and doesn't even required managing resources like shared memory as the OS can do that for us automatically. See my answer for how: https://stackoverflow.com/a/68009831/5946921 – Will Da Silva Jun 17 '21 at 13:33
2

You can take advantage of multiprocessing to run the simulations inside of subprocesses, and leverage the copy-on-write benefits of forking to unpickle/process the data only once at the start:

import multiprocessing
import pickle


# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')


# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))


def _run_simulation(_):
    # Wrapper for `run_simulation` that takes one argument. The function passed
    # into `multiprocessing.Pool.map` must take one argument.
    run_simulation()


with mp.Pool() as pool:
    pool.map(_run_simulation, range(num_simulations))

If you want to parameterize each simulation run, you can do so like so:

import multiprocessing
import pickle


# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')


# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))


with mp.Pool() as pool:
    simulations = ('arg for simulation run', 'arg for another simulation run')
    pool.map(run_simulation, simulations)

This way the run_simulation function will be passed in the values from the simulations tuple, which can allow for having each simulation run with different parameters, or even just assign each run a ID number of name for logging/saving purposes.

This whole approach relies on fork being available. For more information about using fork with Python's built-in multiprocessing library, see the docs about contexts and start methods. You may also want to consider using the forkserver multiprocessing context (by using mp = multiprocessing.get_context('fork')) for the reasons described in the docs.


If you don't want to run your simulations in parallel, this approach can be adapted for that. The key thing is that in order to only have to process the data once, you must call run_simulation within the process that processed the data, or one of its child processes.

If, for instance, you wanted to edit what run_simulation does, and then run it again at your command, you could do it with code resembling this:

main.py:

import multiprocessing
from multiprocessing.connection import Connection
import pickle

from data import load_data


# Load/process data in the parent process
load_data()
# Now child processes can access the data nearly instantaneously


# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork') # Consider using 'forkserver' instead


# This is only ever run in child processes
def load_and_run_simulation(result_pipe: Connection) -> None:
    # Import `run_simulation` here to allow it to change between runs
    from simulation import run_simulation
    # Ensure that simulation has not been imported in the parent process, as if
    # so, it will be available in the child process just like the data!
    try:
        run_simulation()
    except Exception as ex:
        # Send the exception to the parent process
        result_pipe.send(ex)
    else:
        # Send this because the parent is waiting for a response
        result_pipe.send(None)


def run_simulation_in_child_process() -> None:
    result_pipe_output, result_pipe_input = mp.Pipe(duplex=False)
    proc = mp.Process(
        target=load_and_run_simulation,
        args=(result_pipe_input,)
    )
    print('Starting simulation')
    proc.start()
    try:
        # The `recv` below will wait until the child process sends sometime, or
        # will raise `EOFError` if the child process crashes suddenly without
        # sending an exception (e.g. if a segfault occurs)
        result = result_pipe_output.recv()
        if isinstance(result, Exception):
            raise result # raise exceptions from the child process
        proc.join()
    except KeyboardInterrupt:
        print("Caught 'KeyboardInterrupt'; terminating simulation")
        proc.terminate()
    print('Simulation finished')


if __name__ == '__main__':
    while True:
        choice = input('\n'.join((
            'What would you like to do?',
            '1) Run simulation',
            '2) Exit\n',
        )))
        if choice.strip() == '1':
            run_simulation_in_child_process()
        elif choice.strip() == '2':
            exit()
        else:
            print(f'Invalid option: {choice!r}')

data.py:

from functools import lru_cache

# <obtain 'DATA_ROOT' and 'pickle_name' here>


@lru_cache
def load_data():
    with open(DATA_ROOT + pickle_name, 'rb') as f:
        return pickle.load(f)

simulation.py:

from data import load_data


# This call will complete almost instantaneously if `main.py` has been run
data = load_data()


def run_simulation():
    # Run the simulation using the data, which will already be loaded if this
    # is run from `main.py`.
    # Anything printed here will appear in the output of the parent process.
    # Exceptions raised here will be caught/handled by the parent process.
    ...

The three files detailed above should all be within the same directory, alongside an __init__.py file that can be empty. The main.py file can be renamed to whatever you'd like, and is the primary entry-point for this program. You can run simulation.py directly, but that will result in a long time spent loading/processing the data, which was the problem you ran into initially. While main.py is running, the file simulation.py can be edited, as it is reloaded every time you run the simulation from main.py.

For macOS users: forking on macOS can be a bit buggy, which is why Python defaults to using the spawn method for multiprocessing on macOS, but still supports fork and forkserver for it. If you're running into crashes or multiprocessing-related issues, try adding OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES to your environment. See https://stackoverflow.com/a/52230415/5946921 for more details.

Will Da Silva
  • 6,386
  • 2
  • 27
  • 52
  • @etayluz I have edited my answer to add an approach that I believe more closely matches your use-case. Please let me know if you have any questions about this, or if there is anything I can do to help. – Will Da Silva Jun 16 '21 at 21:56
  • Thanks for this! Unfortunately, I don't think it will work because I need to restart after each file edit with this approach. And if I have to restart I have to reload data. – etayluz Jun 17 '21 at 01:30
  • @etayluz No, you don't. See the approach at the bottom of my answer. The file containing `run_simulation` is re-imported every time. You can edit that file, and then enter "1" at the prompt to re-run it. If the previous run is still running, you can enter "ctrl+c" to stop it, and then choose "1" at the prompt. – Will Da Silva Jun 17 '21 at 01:34
  • Thanks! Please see my question - I have already tried this technique and it works weird for a program with lots of files. Some modules get reloaded but others don't. It's not a dependable or scalable technique in my experience. At this point I'm leaning more towards a Producer->Consumer shared memory paradigm. – etayluz Jun 17 '21 at 01:38
  • @etayluz This is not the same approach you've tried. I made sure of that. This specifically only imports the simulation code inside of child processes that haven't already imported it. Thus a process never reloads an already loaded module, which solves those problems you ran into. Let's take this to a chat room: https://chat.stackoverflow.com/rooms/233862/room-for-will-da-silva-and-etayluz – Will Da Silva Jun 17 '21 at 01:40
  • 1
    I see what you're saying now! Thanks for clarifying that. Let me try this tomorrow (it's late here) - and get back to you on this. Thank you! – etayluz Jun 17 '21 at 02:17
  • Now what happens if there is an exception in the child process? How does the parent process catch and handle that exception? – etayluz Jun 17 '21 at 02:19
  • @etayluz I have updated my answer. It is now more detailed, and hopefully more useful for you. It should be noted that I have used an approach like this in production code, so I am confident it can work for you. If exceptions are raised in the child process, the child process will send the exception to the parent process, and then terminate. Within the parent process you can handle the exception however you like. This comment chain is getting long, so we should continue this to the chat room: https://chat.stackoverflow.com/rooms/233862/room-for-will-da-silva-and-etayluz – Will Da Silva Jun 17 '21 at 02:56
  • I don't why but none of my edits are being reflected from one run to the next :( I need to quit and restart each time. It works beautifully otherwise. I've tried editing 20 different files - none of them work. – etayluz Jun 17 '21 at 15:13
  • @etayluz I would be happy to help in the chat: https://chat.stackoverflow.com/rooms/233862/room-for-will-da-silva-and-etayluz – Will Da Silva Jun 17 '21 at 15:19
  • Thanks Will! I need a little time to refactor my program - it might be some caveat in my program that's causing the new edits to fail. – etayluz Jun 17 '21 at 15:36
  • As I guess, the check for windows is needed. And `fork` is needed to be replaced to `spawn` in the case of windows. – CPPCPPCPPCPPCPPCPPCPPCPPCPPCPP Jun 19 '21 at 08:56
  • @CPPCPPCPPCPPCPPCPPCPPCPPCPPCPP It can run on Windows using Cygwin, or the Windows Linux subsystem, or by simply running `simulation.py` directly. In the case of running `simulation.py` directly, it will be slow as the data will need to be unpickled each time, but this seems to be a necessary compromise. We cannot solve the need to edit the simulation code between runs without using sub processes, and we need COW memory (via forking) to unpickled only once. – Will Da Silva Jun 19 '21 at 11:32
0

This is not exact answer to the question as the Q looks as pickle and SHM are required, but others went of the path, so I am going to share a trick of mine. It might help you. There are some fine solutions here using the pickle and SHM anyway. Regarding this I can offer only more of the same. Same pasta with slight sauce modifications.

Two tricks I employ when dealing with your situations are as follows.

First is to use sqlite3 instead of pickle. You can even easily develop a module for a drop-in replacement using sqlite. Nice thing is that data will be inserted and selected using native Python types, and you can define yourown with converter and adapter functions that would use serialization method of your choice to store complex objects. Can be a pickle or json or whatever.

What I do is to define a class with data passed in through *args and/or **kwargs of a constructor. It represents whatever obj model I need, then I pick-up rows from "select * from table;" of my database and let Python unwrap the data during the new object initialization. Loading big amount of data with datatype conversions, even the custom ones is suprisingly fast. sqlite will manage buffering and IO stuff for you and do it faster than pickle. The trick is construct your object to be filled and initiated as fast as possible. I either subclass dict() or use slots to speed up the thing. sqlite3 comes with Python so that's a bonus too.

The other method of mine is to use a ZIP file and struct module. You construct a ZIP file with multiple files within. E.g. for a pronunciation dictionary with more than 400000 words I'd like a dict() object. So I use one file, let say, lengths.dat in which I define a length of a key and a length of a value for each pair in binary format. Then I have a one file of words and one file of pronunciations all one after the other. When I load from file, I read the lengths and use them to construct a dict() of words with their pronunciations from two other files. Indexing bytes() is fast, so, creating such a dictionary is very fast. You can even have it compressed if diskspace is a concern, but some speed loss is introduced then.

Both methods will take less place on a disk than the pickle would. The second method will require you to read into RAM all the data you need, then you will be constructing the objects, which will take almost double of RAM that the data took, then you can discard the raw data, of course. But alltogether shouldn't require more than the pickle takes. As for RAM, the OS will manage almost anything using the virtual memory/SWAP if needed.

Oh, yeah, there is the third trick I use. When I have ZIP file constructed as mentioned above or anything else which requires additional deserialization while constructing an object, and number of such objects is great, then I introduce a lazy load. I.e. Let say we have a big file with serialized objects in it. You make the program load all the data and distribute it per object which you keep in list() or dict(). You write your classes in such a way that when the object is first asked for data it unpacks its raw data, deserializes and what not, removes the raw data from RAM then returns your result. So you will not be losing loading time until you actually need the data in question, which is much less noticeable for a user than 20 secs taking for a process to start.

Dalen
  • 4,128
  • 1
  • 17
  • 35
0

I implemented the python-preloaded script, which can help you here. It will store the CPython state at an early stage after some modules are loaded, and then when you need it, you can restore from this state and load your normal Python script. Storing currently means that it will stay in memory, and restoring means that it does a fork on it, which is very fast. But these are implementation details of python-preloaded and should not matter to you.

So, to make it work for your use case:

  • Make a new module, data_preloaded.py or so, and in there, just this code:

    preloaded_data = load_pickle(...)
    
  • Now run py-preloaded-bundle-fork-server.py data_preloaded -o python-data-preloaded.bin. This will create python-data-preloaded.bin, which can be used as a replacement for python.

  • I assume you have started python your_script.py before. So now run ./python-data-preloaded.bin your_script.py. Or also just python-data-preloaded.bin (no args). The first time, this will still be slow, i.e. take about 20 seconds. But now it is in memory.

  • Now run ./python-data-preloaded.bin your_script.py again. Now it should be extremely fast, i.e. a few milliseconds. And you can start it again and again and it will always be fast, until you restart your computer.

Albert
  • 65,406
  • 61
  • 242
  • 386