9

I use Tensorflow, but I'm writing documentation for users that will typically vary across deep learning frameworks.

When working with datasets that don't fit on the local filesystem (TB+) I sample data from a remote data store and write samples locally to a Tensorflow standardtfrecords format.

During the first epoch of training I will have only sampled a few values, therefore an epoch of local data is very small, I train on it. On epoch 2 I re-examine what data files have been produced by my sampling subprocesses (now more) and train on the expanded set of local data files for the next epoch. Repeat the process each epoch. In this way I build up a local cache of samples and can evict older samples as I fill up the local storage. The local samples cache grows at about the time the model needs the variance the most (towards the latter part of training).

In Python/Tensorflow it's crucial that I not deserialize the data in the Python training loop process because the Python GIL can't support the data transfer rates (300-600 MB/sec, the data is raw scientific uncompressible), and thus GPU performance suffers when the Python GIL can't service the training loop fast.

Writing the samples to a tfrecords file from subprocesses (python multiprocessing) allows tensorflow's native TFRecordsDataset to do deserialization outside of Python and thus we sidestep the Python GIL issues, and I can saturate a GPU with high IO data rates.

I would like to know how I would address this issue in Pytorch. I'm writing about the sampling strategy that's being used, and want to provide specific recommendations to users of both Tensorflow and PyTorch, but I don't know the PyTorch preprocessing ecosystem well enough to write with sufficient detail.

Side note: the only purely Python based solution to support these data transfer rates may come in Python 3.8 with System V shared memory and multiprocessing, but I haven't tried that yet as support for it isn't quite sufficient (soon it will be). Existing multiprocessing solutions aren't sufficient because they require deserialization in the training loop process and thus lock the GIL during deserialization at high IO rates.

David Parks
  • 30,789
  • 47
  • 185
  • 328
  • 2
    How do you know data transfer rates suffer from Python GIL? To my best knowledge, it's CPU bound operation that is affected by GIL in most cases, not I/O bound operation. –  Feb 17 '20 at 14:05
  • In my testing, just doing deserialization between Python processes at the fastest data rates I can achieve keeps the target process at 100% CPU utilization. I've attempted many approaches, asyncio, multiprocessing, even direct socket reads. In the case of direct socket reads I can get 4GB/sec across processes, and the moment I even try to join binary strings I drop to 2GB/sec, and anything more complex drops me to about 1GB/sec max xfer rate. That's all with the target process fully utilizing the core and thus locking the GIL. – David Parks Feb 17 '20 at 16:49
  • Note that this isn't really an issue with common large datasets like imagenet because the IO needed to move compressed JPEGs on large neural networks is small compared to what uncompressed scientific data training on small networks demands. – David Parks Feb 17 '20 at 16:50
  • 1
    a string joining is categorized into a CPU bound operation and it can easily demand a 100% CPU capacity without utilizing I/O capacity of the machine at all. So, it's not an evidence that a GIL restricts the I/O throughput. –  Feb 17 '20 at 23:41
  • Indeed that's my point, at high IO throughput the problem becomes CPU bound because there are sufficiently many trivial operations such as deserialization and just doing basic data manipulation in the pipeline from raw serialized data -> GPU. This is why I sidestep Python entirely under Tensorflow. As long as all of these operations happen in Tensorflow they are parallelized in C, outside Python. The main process (training loop) is free to consume ~6 cores to achieve its purpose of getting data from its serialized form to the GPU, but the Python GIL isn't CPU bound, so no negative impact. – David Parks Feb 17 '20 at 23:57
  • 2
    Those trivial operations don't claim main process's GIL if the data are loaded by `DataLoader` as in my answer. –  Feb 18 '20 at 00:13

1 Answers1

10

Actually, you can easily deserialize data in a subprocess by using torch.utils.data.DataLoader. By setting num_workers argument to 1 or a bigger value, you can spawn subprocesses with their own python interpreters and GILs.

loader = torch.utils.data.DataLoader(your_dataset, num_workers=n, **kwargs)
for epoch in range(epochs):
    for batch_idx, data in enumerate(loader):
         # loader in the main process does not claim GIL at this point

A Dataloader requires a torch.utils.data.Dataset to get data from. It may not be a trivial job to implement a proper subclass in your case. In case you need to recreate a Dataset instance for every epoch, you can do something like this.

for epcoh in range(epochs):
    dset = get_new_dataset()
    loader = torch.utils.data.DataLoader(dset, num_workers=n, **kwargs)
    for batch_idx, data in enumerate(loader):
        # Do training

or even better

dset = get_new_dataset()
loader = torch.utils.data.DataLoader(dset, num_workers=n, **kwargs)

for epcoh in range(epochs):
    last_batch_idx =  (len(dset)-1) // loader.batch_size
    for batch_idx, data in enumerate(loader):
        # Prepare next loader in advance to avoid blocking
        if batch_idx == last_batch_idx:
            dset = get_new_dataset()
            loader = torch.utils.data.DataLoader(dset, num_workers=n, **kwargs)
        # Do training

As a side note, please note that it's CPU bound operation that is affected by GIL in most cases, not I/O bound operation, i.e., threading will do for any purely I/O heavy operation and you don't even need subprocess. For more information please refer to this question and this wikipedia article.

  • Just to confirm, does `torch.utils.data.DataLoader` place data on the GPU from the subprocesses or is it trying to use python's multiprocessing to move it to the training loop process? I have found that just deserialization from one process to another at data rates approaching 1GB/sec is >1 full core of work, hence the GIL issues I've encountered when trying this approach in TF. But if `torch.utils.data.DataLoader` is moving data onto the GPU in a way that doesn't require Python deserialization then all is well. I just want to confirm that understanding. – David Parks Feb 17 '20 at 16:55
  • @DavidParks What specific function do you use when you are testing deserialization from one process to another? seems like the deserialization process involves a CPU bound operation, hence the GIL issues. –  Feb 17 '20 at 23:58
  • I've tried multiprocessing (very slow), Pipes (better), and raw socket reads (best). None of these work when the IO rates are a substantial fraction of a GB/sec, just moving that much data demands more than 1 core, and hence Python solutions (prior to 3.8 and System V shared memory) fall apart for me in Tensorflow. That's why I write to tfrecords files and let tensorflow do deserialization outside Python. TF doesn't lock the Python GIL, and parallelizes the operations, so my main process uses 600% CPU while the Python GIL remains idle and free to service the training loop. – David Parks Feb 18 '20 at 00:02
  • @DavidParks I mean, what kind of deserialization function or library do you use? (not interprocess communication library). `torch.utils.data.DataLoader` can easily utilize 600% CPU or more, and the main process doesn't need much CPU power in most cases when the training is mostly GPU computation (When the training is mostly CPU computation, still no problem because pytorch's matrix operation can easily utilize multiple CPUs). –  Feb 18 '20 at 00:26
  • Just using pickle to deserialize across python processes, then a python generator function to feed samples into the TensorFlow ecosystem. That's the approach that fails on me. Once the data is into the TensorFlow ecosystem then it's placed on the GPU and training is another story. TF doesn't provide a way for python subprocesses to feed data to TF, you only have a few options, and tfrecords formatted data (Protocol Buffers format) is the most logical. It sounds like it may be easier in PyTorch, so I'll have some PyTorch users here validate it. – David Parks Feb 18 '20 at 00:32
  • I would like to clarify. If I write python code in a dataset, and set num_workers to something >1, they will NOT be under the GIL? I found a GitHub project that seems to [claim the opposite](https://github.com/Lyken17/Efficient-PyTorch) – harveyslash Jul 15 '20 at 22:07
  • @harveyslash I went to the webpage you linked, but I couldn't find any opposite claim. The data loading issue mentioned there is about IO time. What a GIL can get in the way is CPU operation, not IO operation. https://en.wikipedia.org/wiki/I/O_bound –  Jul 22 '20 at 03:25