I am currently writing a custom PyTorch dataloader that loads the training data for machine learning from a 2GB JSON file.
The dataloader, which is basically a CPython extension module written in C++, loads the entire JSON file into a temporary data structure and converts the data into another in-memory format which is compatible with the PyTorch model I'm using.
I've managed to make the program load and convert the data at a reasonable speed thanks to the brilliant free libraries out there, but it turned out that the program consumes too much memory when I tried to scale up the training.
When PyTorch performs multi-GPU/multi-node training, the library allocates one Python process for each GPU, which means there are several separate instances of my dataloader running at the same time.
My machine has enough RAM for one dataloader to run without problems, but not enough RAM to run several of them.
I could confirm that once the RAM space is exhausted, the dataloaders would start using up several GBs of swap space and it degrades the performance severely.
So I started to see where I could save some RAM space in my dataloader.
I found out that the temporary data structure to which the JSON data is initially loaded was totally unnecessary after the conversion is finished, so I want to free up this memory space for the other processes.
The question is, how am I supposed to do this with the standard library? The data structure basically consists of std::vector
s and std::unordered_map
s on the heap, but just destructing them does not free up the memory space because there is no heap compaction mechanism implemented in Linux glibc.
On Windows I could implement a custom std::allocator
that resides in a separate heap and just destroy the entire heap after use (though I'm not sure this actually works), but glibc malloc()
does not take a heap handle parameter.
I don't believe this question is asked for the first time, and I don't either think that implementing a custom std::allocator
based on a third-party heap allocator is the only answer. How could I free up some heap space for another process, in Linux glibc? Could you give me pointers?
Thanks in advance.