Sharing python objects (e.g. Pandas Dataframe) between independently running python scripts

Question

this is my first question here and I hope that I'm not opening up a question very similar to an already existing one. If that's the case, please excuse me!

So, the situation I'm having a bit of trouble with, is the following:

I would like to run independent python scripts in parallel, which can access the same python objects, in my case a Pandas Dataframe. My idea is that one script is basically constantly running and subscribes to a stream of data (here data that is pushed via a websocket), which it then appends to a shared Dataframe. The second script should be able to be started independent of the first one and still access the Dataframe, which is constantly updated by the first script. In the second script I want to execute different kinds of analysis in predefined time intervals or do other relatively time intesive operations on the live data.

I've already tried to run all operations from within one script, but I kept having disconnects from the websocket. In the future there are also multiple scripts supposed to access the shared data in real time.

Instead of saving a csv file or pickle after every update in script 1, I would rather have a solution where both scripts basically share the same memory. I also only need one of the scripts to write and append to the Dataframe, the other only needs to read from it. The multiprocessing module seems to have some interesting features, that might help but I couldn't realy make any sense of it so far. I also looked at global variables but that also doesn't seem to be the right thing to use in this case.

What I imagine is something like this (I know, that the code is completely wrong and this is just for illustrative purposes):

The first script should keep assigning new data from the datastream to the dataframe and share this object.

from share_data import share

shared_df = pd.DataFrame()

for data from datastream:
        shared_df.append(data)
        share(shared_df)

The second script should then be able to do the following:

from share_data import get

df = get(shared_df)

Is this at all possible or do you have any ideas on how the accomplish this in a simple way? Or do you have any suggestions which packages might be good to use?

Best regards, Ole

Have a look at [this question](https://stackoverflow.com/questions/1829116/how-to-share-variables-across-scripts-in-python) — Simon Crane, Jan 21 '20 at 16:20
Rather than the second script have access to the DataFrame in shared memory with the first script, is it an option for the second script to make requests to the first (sounds like it's a web server) which then returns an analysis result? If not, you could dig into the internals of a library like [this](https://github.com/nalepae/pandarallel) which does parallel processing on DataFrames using shared memory. [DiskCache](https://github.com/grantjenks/python-diskcache) may also be an option depending on the specifics (might be slow for repeated access unless using memory mapped files). — totalhack, Jan 21 '20 at 22:24
Thank you for your comments. I will check out your proposed links. The two scripts are basically running on the same server. I just would like to have it run separately, because I want to be able to easily build it up modular (e.g. adding new analysis and strategies at a later point). Again, thanks for your help and advise! — Ole, Jan 23 '20 at 10:06

aveuiller · Accepted Answer · 2020-01-21T22:04:52.630

You already have quite the right sense of what you can do to use your data.

The best solution depends on your actual needs, so I will try to cover the possibilities with a working example.

What you want

If I understand your need completely, you want to

continuously update a DataFrame (from a websocket)
while doing some computations on the same DataFrame
keeping the DataFrame up to date on the computation workers,
one computation is CPU intensive
another is not.

What you need

As you said, you will need a way to run different threads or processes in order to keep the computation running.

How about Threads

The easiest way to achieve what you want would be to use the threading library. Since threads can share memory, and you only have one worker actually updating the DataFrame, it is quite easy to propose a way to manage the data up to date:

import time
from dataclasses import dataclass

import pandas
from threading import Thread


@dataclass
class DataFrameHolder:
    """This dataclass holds a reference to the current DF in memory.
    This is necessary if you do operations without in-place modification of
    the DataFrame, since you will need replace the whole object.
    """
    dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])

    def update(self, data):
        self.dataframe = self.dataframe.append(data, ignore_index=True)


class StreamLoader:
    """This class is our worker communicating with the websocket"""

    def __init__(self, df_holder: DataFrameHolder) -> None:
        super().__init__()
        self.df_holder = df_holder

    def update_df(self):
        # read from websocket and update your DF.
        data = {
            'A': [1, 2, 3],
            'B': [4, 5, 6],
        }
        self.df_holder.update(data)

    def run(self):
        # limit loop for the showcase
        for _ in range(5):
            self.update_df()
            print("[1] Updated DF %s" % str(self.df_holder.dataframe))
            time.sleep(3)


class LightComputation:
    """This class is a random computation worker"""

    def __init__(self, df_holder: DataFrameHolder) -> None:
        super().__init__()
        self.df_holder = df_holder

    def compute(self):
        print("[2] Current DF %s" % str(self.df_holder.dataframe))

    def run(self):
        # limit loop for the showcase
        for _ in range(5):
            self.compute()
            time.sleep(5)


def main():
    # We create a DataFrameHolder to keep our DataFrame available.
    df_holder = DataFrameHolder()

    # We create and start our update worker
    stream = StreamLoader(df_holder)
    stream_process = Thread(target=stream.run)
    stream_process.start()

    # We create and start our computation worker
    compute = LightComputation(df_holder)
    compute_process = Thread(target=compute.run)
    compute_process.start()

    # We join our Threads, i.e. we wait for them to finish before continuing
    stream_process.join()
    compute_process.join()


if __name__ == "__main__":
    main()

Note that we use a class to hold reference of the current DataFrame because some operations like append are not necessarily inplace, thus, if we directly sent the reference to the DataFrame object, the modification would be lost on the worker. Here the DataFrameHolder object will keep the same location in memory, thus the worker can still access the updated DataFrame.

Processes may be more powerful

Now if you need more computation power, processes may be more useful since they enable you to isolate your worker on a different core. To start a Process instead of a Thread in python, you can use the multiprocessing library. The API of both objects are the same and you will only have to change the constructors as follow

from threading import Thread
# I create a thread
compute_process = Thread(target=compute.run)


from multiprocessing import Process
# I create a process that I can use the same way
compute_process = Process(target=compute.run)

Now if you tried to change the values in the above script, you will see that the DataFrame is not updating correctly.

For this you will need more work since the two processes don't share memory, and you have multiple ways of communicating between them (https://en.wikipedia.org/wiki/Inter-process_communication)

The reference of @SimonCrane is quite interesting on the matters and showcases the use of a shared-memory between two processes using multiprocessing.manager.

Here is a version with the worker using a separate process with a shared memory:

import logging
import multiprocessing
import time
from dataclasses import dataclass
from multiprocessing import Process
from multiprocessing.managers import BaseManager
from threading import Thread

import pandas


@dataclass
class DataFrameHolder:
    """This dataclass holds a reference to the current DF in memory.
    This is necessary if you do operations without in-place modification of
    the DataFrame, since you will need replace the whole object.
    """
    dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])

    def update(self, data):
        self.dataframe = self.dataframe.append(data, ignore_index=True)

    def retrieve(self):
        return self.dataframe


class DataFrameManager(BaseManager):
    """This dataclass handles shared DataFrameHolder.
    See https://docs.python.org/3/library/multiprocessing.html#examples
    """
    # You can also use a socket file '/tmp/shared_df'
    MANAGER_ADDRESS = ('localhost', 6000)
    MANAGER_AUTH = b"auth"

    def __init__(self) -> None:
        super().__init__(self.MANAGER_ADDRESS, self.MANAGER_AUTH)
        self.dataframe: pandas.DataFrame = pandas.DataFrame(columns=['A', 'B'])

    @classmethod
    def register_dataframe(cls):
        BaseManager.register("DataFrameHolder", DataFrameHolder)


class DFWorker:
    """Abstract class initializing a worker depending on a DataFrameHolder."""

    def __init__(self, df_holder: DataFrameHolder) -> None:
        super().__init__()
        self.df_holder = df_holder


class StreamLoader(DFWorker):
    """This class is our worker communicating with the websocket"""

    def update_df(self):
        # read from websocket and update your DF.
        data = {
            'A': [1, 2, 3],
            'B': [4, 5, 6],
        }
        self.df_holder.update(data)

    def run(self):
        # limit loop for the showcase
        for _ in range(4):
            self.update_df()
            print("[1] Updated DF\n%s" % str(self.df_holder.retrieve()))
            time.sleep(3)


class LightComputation(DFWorker):
    """This class is a random computation worker"""

    def compute(self):
        print("[2] Current DF\n%s" % str(self.df_holder.retrieve()))

    def run(self):
        # limit loop for the showcase
        for _ in range(4):
            self.compute()
            time.sleep(5)


def main():
    logger = multiprocessing.log_to_stderr()
    logger.setLevel(logging.INFO)

    # Register our DataFrameHolder type in the DataFrameManager.
    DataFrameManager.register_dataframe()
    manager = DataFrameManager()
    manager.start()
    # We create a managed DataFrameHolder to keep our DataFrame available.
    df_holder = manager.DataFrameHolder()

    # We create and start our update worker
    stream = StreamLoader(df_holder)
    stream_process = Thread(target=stream.run)
    stream_process.start()

    # We create and start our computation worker
    compute = LightComputation(df_holder)
    compute_process = Process(target=compute.run)
    compute_process.start()

    # The managed dataframe is updated in every Thread/Process
    time.sleep(5)
    print("[0] Main process DF\n%s" % df_holder.retrieve())

    # We join our Threads, i.e. we wait for them to finish before continuing
    stream_process.join()
    compute_process.join()


if __name__ == "__main__":
    main()

As you can see, the differences between threading and processing are quite tiny.

With a few tweaks, you can start from there to connect to the same manager if you want to use a different file to handle your CPU intensive processing.

Thank you very much for you extensive answer @aveuiller. Very understandable explanations and code. I will try out your proposed solutions and let you know how it's working for my case. Besides your proposed packages, I found that there is a new feature for shared memory in the multiprocessing package [link](https://docs.python.org/3/library/multiprocessing.shared_memory.html). Could this also be a possible solution to the problem? Do you already have experience with it? Thanks again for your answer! — Ole, Jan 23 '20 at 09:35
I did not have the occasion to play with this new API but it seems to expose a low level `SharedMemory` API with which you can create shared memory blocks. This package is also providing a class inheriting from the `BaseManager` : the `SharedMemoryManager`. You can experiment with it and see how it goes, it enables you to create managed shared memory blocks, in whose you can try to inject the DF. I'm not sure if you will have benefits from this though. — aveuiller, Jan 26 '20 at 10:00

Sharing python objects (e.g. Pandas Dataframe) between independently running python scripts

1 Answers1

What you want

What you need

How about Threads

Processes may be more powerful

Linked