17

The Problem

I have written a neural network classifier that takes in massive images (~1-3 GB apiece), patches them up, and passes the patches through the network individually. Training was going really slowly, so I benchmarked it and found that it was taking ~50s to load the patches from one image into memory (using the Openslide library), and only ~.5 s to pass them through the model.

However, I'm working on a supercomputer with 1.5Tb of RAM of which only ~26 Gb is being utilized. The dataset is a total of ~500Gb. My thinking is that if we could load the entire dataset into memory it would speed up training tremendously. But I am working with a research team and we are running experiments across multiple Python scripts. So ideally, I would like to load the entire dataset into memory in one script and be able to access it across all scripts.

More details:

  • We run our individual experiments in separate Docker containers (on the same machine), so the dataset has to be accessible across multiple containers.
  • The dataset is the Camelyon16 Dataset; images are stored in .tif format.
  • We just need to read the images, no need to write.
  • We only need to access small portions of the dataset at a time.

Possible Solutions

I have found many posts about how to share Python objects or raw data in memory across multiple Python scripts:

Sharing Python data across scripts

Server Processes with SyncManager and BaseManager in the multiprocessing module | Example 1 | Example 2 | Docs - Server Processes | Docs - SyncManagers

  • Positives: Can be shared by processes on different computers over a network (can it be shared by multiple containers?)
  • Possible issue: slower than using shared memory, according to the docs. If we share memory across multiple containers using a client/server, will that be any faster than all of the scripts reading from disk?
  • Possible issue: according to this answer, the Manager object pickles objects before sending them, which could slow things down.

mmap module | Docs

  • Possible issue: mmap maps the file to virtual memory, not physical memory - it creates a temporary file.
  • Possible issue: because we use only a small portion of the dataset at a time, the virtual memory puts the entire dataset on disk, we run into thrashing issues and the program slogs.

Pyro4 (client-server for Python objects) | Docs

The sysv_ipc module for Python. This demo looks promising.

  • Possible issues: maybe just a lower level exposure of things available in the built-in multi-processing module?

I also found this list of options for IPC/networking in Python.

Some discuss server-client setups, some discuss serialization/deserialization, which I'm afraid will take longer than just reading from disk. None of the answers I've found address my question about whether these will result in a performance improvement on I/O.

Sharing memory across Docker containers

Not only do we need to share Python objects/memory across scripts; we need to share them across Docker containers.

The Docker documentation explains the --ipc flag pretty well. What makes sense to me according to the documentation is running:

docker run -d --ipc=shareable data-server
docker run -d --ipc=container:data-server data-client

But when I run my client and server in separate containers with an --ipc connection set up as described above, they are unable to communicate with each other. The SO questions I've read (1, 2, 3, 4) don't address integration of shared memory between Python scripts in separate Docker containers.

My Questions:

  • 1: Would any of these provide faster access than reading from disk? Is it even reasonable to think that sharing data in memory across processes/containers would improve performance?
  • 2: Which would be most appropriate solution for sharing data in memory across multiple docker containers?
  • 3: How to integrate memory-sharing solutions from Python with docker run --ipc=<mode>? (is a shared IPC namespace even the best way to share memory across docker containers?)
  • 4: Is there a better solution than these to fix our problem of large I/O overhead?

Minimal Working Example - Updated. Requires no external dependencies!

This is my naive approach to memory sharing between Python scripts in separate containers. It works when the Python scripts are run the same container, but not when they are run in separate containers.

server.py

from multiprocessing.managers import SyncManager
import multiprocessing

patch_dict = {}

image_level = 2
image_files = ['path/to/normal_042.tif']
region_list = [(14336, 10752),
               (9408, 18368),
               (8064, 25536),
               (16128, 14336)]

def load_patch_dict():

    for i, image_file in enumerate(image_files):
        # We would load the image files here. As a placeholder, we just add `1` to the dict
        patches = 1
        patch_dict.update({'image_{}'.format(i): patches})

def get_patch_dict():
    return patch_dict

class MyManager(SyncManager):
    pass

if __name__ == "__main__":
    load_patch_dict()
    port_num = 4343
    MyManager.register("patch_dict", get_patch_dict)
    manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
    # Set the authkey because it doesn't set properly when we initialize MyManager
    multiprocessing.current_process().authkey = b"password"
    manager.start()
    input("Press any key to kill server".center(50, "-"))
    manager.shutdown

client.py

from multiprocessing.managers import SyncManager
import multiprocessing
import sys, time

class MyManager(SyncManager):
    pass

MyManager.register("patch_dict")

if __name__ == "__main__":
    port_num = 4343

    manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
    multiprocessing.current_process().authkey = b"password"
    manager.connect()
    patch_dict = manager.patch_dict()

    keys = list(patch_dict.keys())
    for key in keys:
        image_patches = patch_dict.get(key)
        # Do NN stuff (irrelevant)

These scripts work fine for sharing the images when the scripts are run in the same container. But when they are run in separate containers, like this:

# Run the container for the server
docker run -it --name cancer-1 --rm --cpus=10 --ipc=shareable cancer-env
# Run the container for the client
docker run -it --name cancer-2 --rm --cpus=10 --ipc=container:cancer-1 cancer-env

I get the following error:

Traceback (most recent call last):
  File "patch_client.py", line 22, in <module>
    manager.connect()
  File "/usr/lib/python3.5/multiprocessing/managers.py", line 455, in connect
    conn = Client(self._address, authkey=self._authkey)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
Jacob Stern
  • 3,758
  • 3
  • 32
  • 54
  • I suspect the issue with your containerized setup is that your docker containers live in different networks and cannot talk to each other via `127.0.0.1`. You can try to start them with `--network host` maybe that helps. – swenzel Jul 09 '19 at 14:53
  • Thanks for the comment - That helped, it got me farther. Rather than getting a `ConnectionRefusedError` on `manager.connect()` in `client.py`, the program makes it to `image_patches = patch_dict.get(key)` but raises [this error](https://pastebin.com/x7vn9sTJ). – Jacob Stern Jul 09 '19 at 15:42
  • @JacobStern, you are using the network and not `ipc` here. Instead of using `--ipc=container:cancer-1` use `--network=container:cancer-1` and then try – Tarun Lalwani Jul 10 '19 at 04:14
  • That makes sense. So under the hood, server processes communicate using sockets that communicate over networks, requiring network communication between docker containers? – Jacob Stern Jul 10 '19 at 15:56
  • According to [this article] (https://dzone.com/articles/docker-in-action-the-shared-memory-namespace) and a couple of others, it sounds like shared memory is the way to go, because network/pipe speeds are not nearly as fast as memory speed. Is that right? – Jacob Stern Jul 10 '19 at 18:02
  • @TarunLalwani I also noticed that docker has several namespaces, including `net` and `ipc`. Do you know which one of these a server process would operate in? – Jacob Stern Jul 10 '19 at 23:06
  • You are creating an HTTP server so it would be net only, I am not sure what code is needed for IpC though – Tarun Lalwani Jul 11 '19 at 01:11
  • Could also be that they just share the metadata (i.e. memory addresses, field names, field types) via network and do the rest via shared memory. But I haven't tried it out so I don't know either. – swenzel Jul 12 '19 at 11:07

2 Answers2

7

I recommend you try using tmpfs.

It is a linux feature allowing you to create a virtual file system, all of which is stored in the RAM. This allows very fast file access and takes as little as one bash command to set up.

In addition to being very fast and straight-forward, it has many advantages in your case:

  • No need to touch current code - the structure of the dataset stays the same
  • No extra work to create the shared dataset - just cp the dataset into the tmpfs
  • Generic interface - being a filesystem, you could easily integrate the on-RAM dataset with other component in your system that aren't necessarily written in python. For example, it would be easy to use inside your containers, just pass the mount's directory into them.
  • Will fit other environments - if your code will have to run on a different server, tmpfs can adapt and swap pages to the hard drive. If you will have to run this on a server with no free RAM, you could just have all your files on the hard drive with a normal filesystem and not touch your code at all.

Steps to use:

  1. Create a tmpfs - sudo mount -t tmpfs -o size=600G tmpfs /mnt/mytmpfs
  2. Copy dataset - cp -r dataset /mnt/mytmpfs
  3. Change all references from the current dataset to the new dataset
  4. Enjoy


Edit:

ramfs might be faster than tmpfs in some cases as it doesn't implement page swapping. To use it just replace tmpfs with ramfs in the instructions above.

kmaork
  • 5,722
  • 2
  • 23
  • 40
  • This is by far the simplest; but docs suggest tmpfs can't be used to share memory between containers: https://docs.docker.com/storage/tmpfs/ - but this is a tmpfs on the host, mounted as a regular volume by the container, correct? – Nino Walker Jul 15 '19 at 13:22
  • Thanks - I tried this as you described it, but I didn't see any speedup. Time to access one file from `/tumor/my_file.tif`: 4.4349s. Time to access one file from `/mnt/mytmpfs/tumor/my_file.tif`: 4.6474s. Any idea why that might be? – Jacob Stern Jul 15 '19 at 21:49
  • @NinoWalker yeah, the tmpfs is on the host – kmaork Jul 16 '19 at 16:56
  • @JacomStern can you compare that to the access times on the host? And what kind of access are we talking about? Did you try just reading the file? Also, tmpfs implements swapping, which might cause disk-like speeds. You could try using ramfs (more low level, without swapping) to rule that out. The commands to use it are the same, just replace tmpfs with ramfs. – kmaork Jul 16 '19 at 17:02
  • 1
    I checked on a local machine, and tmpfs was significantly slower than ramfs, I recommend you try it. If it works I will edit my answer accordingly :) – kmaork Jul 16 '19 at 21:22
  • @kmaork I didn't run compare access times inside docker containers yet -- I just compared the times on the host. As far as the type of access, I am reading the files with the `openslide` package, with [this script](https://pastebin.com/7Rn1dm3D). I hosted an example image [here](https://www.dropbox.com/s/717x3pv2j8mtws2/normal_042.tif?dl=0) so you can replicate the script without downloading the entire dataset. I didn't see any better results with ramfs. Anything I might be missing? – Jacob Stern Jul 16 '19 at 22:00
  • It may be that the latency is happening in the `openslide` internals rather than in reading from disk. But I doubt it. Would you mind sharing your script that demonstrates that loading images from ramfs is faster? – Jacob Stern Jul 16 '19 at 22:03
  • Well, in my local machine I just created a ramfs, copied a large file into it, and measured the time it took to perform `cat file > /dev/null` when the file was on ramfs, tmpfs and on the disk. – kmaork Jul 16 '19 at 22:37
  • Another explanation for this behavior, is the linux file cache. Linux caches accessed files on the RAM, and it might be that in your experiment you were accessing a file on the disk that was already cached and therefore the access time was similar to the ramfs one – kmaork Jul 16 '19 at 22:43
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/196545/discussion-between-jacob-stern-and-kmaork). – Jacob Stern Jul 16 '19 at 22:57
1

I think shared memory or mmap solution is proper.

shared memory:

First read dataset in memory in a server process. For python, just use multiprocessing wrapper to create object in shared memory between process, such as: multiprocessing.Value or multiprocessing.Array, then create Process and pass the shared object as args.

mmap:

Store dataset in a file on host. Then each container mount the file into container. If one container open the file and map the file to its virtual memory, other container will not need to read the file from disk to memory when open the file because the file is already in physical memory.

P.S. I am not sure how cpython implementation large shared memory between process, probably cpython shared memory use mmap internal.

menya
  • 1,459
  • 7
  • 8