How to estimate GPU memory requirements for thrust based implementation?

Question

I have 3 different thrust-based implementations that perform certain calculations: first is the slowest and requires the least of GPU memory, second is the fastest and requires the most of GPU memory, and the third one is in-between. For each of those I know the size and data type for each device vector used so I am using vector.size()*sizeof(type) to roughly estimate the memory needed for storage.

So for a given input, based on its size, I would like to decide which implementation to use. In other words, determine the fastest implementation that will fit is in the available GPU memory.

I think that for very long vectors that I am dealing with, the size of the vector.data() that I am calculating is a fairly good estimate and the rest of the overhead (if any) could be disregarded.

But how would I estimate the memory usage overhead (if any) associated with the thrust algorithms implementation? Specifically I am looking for such estimates for transform, copy, reduce, reduce_by_key, and gather. I do not really care about the overhead that is static and is not a function of the algorithm input and output parameters sizes unless it’s very significant.

I understand the implication of the GPU memory fragmentation, etc. but let’s leave this aside for a moment.

Thank you very much for taking the time to look into this.

talonmies Thank you for your answer. I should have mentioned that I am using thrust 1.5.2 that came with CUDA 4.2 installation. My understanding is that thrust “Custom temporary allocation” feature requires 1.6. Is that correct? At one point I tried to swap out 1.5.2 for 1.6 but after I did that I got so many errors reported by the compiler that I had to switch back because at the time I just couldn’t afford spending any time trying to fix them. It appeared that some host/device vector constructors were not available anymore or something along those lines. — Leo, Jun 11 '12 at 16:52
Leo, To comment on a specific answer, please add a comment to the answer using the "add comment" link under the answer. — Heatsink, Jun 11 '12 at 17:57
Heatsink, I completely agree with you, but for some reason there is no Add Comment link/button for me under the talonmies answer. Don't know why... — Leo, Jun 11 '12 at 20:48

talonmies · Accepted Answer · 2012-06-11T15:12:38.343

Thrust is intended to be used like a black box and there is no documentation of the memory overheads of the various algorithms that I am aware of. But it doesn't sound like a very difficult problem to deduce it empirically by running a few numerical experiments. You might expect the memory consumption of a particular alogrithm to be approximable as:

total number of words of memory consumed = a + (1 + b)*N

for a problem with N input words. Here a will be the fixed overhead of the algorithm and 1+b the slope of best fit memory versus N line. b is then the amount of overhead the algorithm per input word.

So the question then becomes how to monitor the memory usage of a given algorithm. Thrust uses an internal helper function get_temporary_buffer to allocate internal memory. The best idea would be to writeyour own implementation of get_temporary_buffer which emits the size it has been called with, and (perhaps) uses a call to cudaGetMemInfo to get context memory statistics at the time the function gets called. You can see some concrete examples of how to intercept get_temporary_buffer calls here.

With a suitably instrumented allocator and some runs with it at a few different problem sizes, you should be able to fit the model above and estimate the b value for a given algorithm. The model can then be used in your code to determine safe maximum problem sizes for a given about of memory.

I hope this is what you were asking about...

How to estimate GPU memory requirements for thrust based implementation?

1 Answers1

Linked