C++ Memory Counting in OpenCV

Question

I have an application written in OpenCV. It consists of two threads that each perform an OpenCV function. How can i determine how much memory each thread is generating?

I'm using libdispatch, Grand Central Dispatch design pattern. It is at a stage where i can have multiple tasks running at once. How can i manage memory in such a situation? With some opencv processes and enough concurrent tasks, i can easily hit my RAM ceiling. How to manage this?

What strategies can be employed in C++?

If each thread had a memory limit, how could this be handled?

Regards,

Daniel

Rob · Answer 1 · 2014-12-20T10:43:57.807

I'm not familiar with the dispatching library/pattern you're using, but I've had a quick glance over what it aims to do. I've done a fair amount of work in the image processing/video processing domain, so hopefully my answer isn't a completely useless wall-of-text ;)

My suspicion is that you're firing off whole image buffers to different threads to run the same processing on them. If this is the case, then you're quickly going to hit RAM limits. If a task (thread) uses N image buffers in its internal functions, and your RAM is M, then you may start running out of legs at M / N tasks (threads). If this is the case, then you may need to resort to firing off chunks of images to the threads instead (see the hints further down on using dependency graphs for processing).

You should also consider the possibility that performance in your particular algorithm is memory bound and not CPU bound. So it may be pointless firing off more threads even though you have extra cores, and perhaps in this case you're better off focusing on CPU SIMD things like SSE/MMX.

Profile first, Ask (Memory Allocator) Questions Later

Using hand-rolled memory allocators that cater for concurrent environments and your specific memory requirements can make a big difference to performance. However, they're unlikely to reduce the amount of memory you use unless you're working with many small objects, where you may be able to do a better job with memory layout when allocating and reclaiming them than the default malloc/free implementations. As you're working with image processing algorithms, the latter is unlikely. You've typically got huge image buffers allocated on the heap as opposed to many small-ish structs.

I'll add a few tips on where to begin reading on rolling your own allocators at the end of my answer, but in general my advice would be to first profile and figure out where the memory is being used. Having written the code you may have a good hunch about where it's going already, but if not tools like valgrind's massif (complicated beast) can be a big help.

After having profiled the code, figure out how you can reduce the memory use. There are many, many things you can do here, depending on what's using the memory. For example:

Free up any memory you don't need as soon as you're done with it. RAII can come in handy here.
Don't copy memory unless you need to.
Share memory between threads and processes where appropriate. It will make it more difficult than working with immutable/copied data, because you'll have to synchronise read/write access, but depending on your problem case it may make a big difference.
If you're using memory caches, and you don't want to cache the data to disk due to performance reasons, then consider using in-memory compression (e.g. zipping some of the cache) when it's falling to the bottom of your least-recently-used cache, for example.
Instead of loading a whole dataset and having each method operate on the whole of it, see if you can chunk it up and only operate on a subset of it. This is particularly relevant when dealing with large data sets.
See if you can get away with using less resolution or accuracy, e.g. quarter-size instead of full size images, or 32 bit floats instead of 64 bit floats (or even custom libraries for 16 bit floats), or perhaps using only one channel of image data at a time (just red, or just blue, or just green, or greyscale instead of RGB).

As you're working with OpenCV, I'm guessing you're either working on image processing or video processing. These can easily gobble up masses of memory. In my experience, initial R&D implementations typically process a whole image buffer in one method before passing it over to the next. This often results in multiple full image buffers being used, which is hugely expensive in terms of memory consumption. Reducing the use of any temporary buffers can be a big win here.

Another approach to alleviate this is to see if you can figure out the data dependencies (e.g. by looking at the ROIs required for low-pass filters, for example), and then processing smaller chunks of the images and joining them up again later, and to avoid temporary duplicate buffers as much as possible. Reducing the memory footprint in this way can be a big win, as you're also typically reducing the chances of a cache miss. Such approaches often hugely complicate the implementation, and unless you have a graph-based framework in place that already supports it, it's probably not something you should attempt before exhausting other options. Intel have a number of great resources pertaining to optimisation of threaded image processing applications.

Tips on Memory Allocators

If you still think playing with memory allocators is going to be useful, here are some tips.

For example, on Linux, you could use

malloc hooks, or
just override them in your main compilation unit (main.cpp), or a library that you statically link, or a shared libary that you LD_PRELOAD, for example.

There are several excellent malloc/free replacements available that you could study for ideas, e.g.

If you're dealing with specific C++ objects, then you can override their new and delete operators. See this link, for example.

Lastly, if I did manage to guess wrong regarding where memory is being used, and you do, in fact, have loads of small objects, then search the web for 'small memory allocators'. Alexander Alexandrescu wrote a couple of great articles on this, e.g. here and here.

So, for different users.... how, say user A has 4Gb or memory and user 2 had 6Gb, how can i ensure i don't over run. how can i wait until there is more space available? — WebSight, Dec 20 '14 at 11:19
Sorry - internet connection flaked out for a bit. In your case, I'd 1) monitor overall system RAM use (approximately) by checking things like sys meminfo on Linux see http://stackoverflow.com/questions/63166/how-to-determine-cpu-and-memory-consumption-from-inside-a-process , and also checking how much memory your process is currently using see http://stackoverflow.com/questions/3853655/in-linux-how-to-tell-how-much-memory-processes-are-using 2) Precalculate how much memory you're approximately going to need for doing your processing. If it doesn't fit, wait until it does. — Rob, Dec 20 '14 at 14:47
I should also warn you that tweaking RAM use is fiddly to get right, because you don't have much control over how and when the OS will decide to swap things out, which is further complicated when other intensive memory-gobbling applications are also trying to employ sophisticated memory pooling/releasing strategies to make the most of the available RAM. So in general, you may find yourself getting easy initial gains, but it gets difficult to get it working well over a large number of systems and configurations. At least that's my experience, but I'm not an expert on this. — Rob, Dec 20 '14 at 14:50

C++ Memory Counting in OpenCV

1 Answers1

Profile first, Ask (Memory Allocator) Questions Later

Tips on Memory Allocators