2

I have 3 different thrust-based implementations that perform certain calculations: first is the slowest and requires the least of GPU memory, second is the fastest and requires the most of GPU memory, and the third one is in-between. For each of those I know the size and data type for each device vector used so I am using vector.size()*sizeof(type) to roughly estimate the memory needed for storage.

So for a given input, based on its size, I would like to decide which implementation to use. In other words, determine the fastest implementation that will fit is in the available GPU memory.

I think that for very long vectors that I am dealing with, the size of the vector.data() that I am calculating is a fairly good estimate and the rest of the overhead (if any) could be disregarded.

But how would I estimate the memory usage overhead (if any) associated with the thrust algorithms implementation? Specifically I am looking for such estimates for transform, copy, reduce, reduce_by_key, and gather. I do not really care about the overhead that is static and is not a function of the algorithm input and output parameters sizes unless it’s very significant.

I understand the implication of the GPU memory fragmentation, etc. but let’s leave this aside for a moment.

Thank you very much for taking the time to look into this.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Leo
  • 259
  • 3
  • 8
  • talonmies Thank you for your answer. I should have mentioned that I am using thrust 1.5.2 that came with CUDA 4.2 installation. My understanding is that thrust “Custom temporary allocation” feature requires 1.6. Is that correct? At one point I tried to swap out 1.5.2 for 1.6 but after I did that I got so many errors reported by the compiler that I had to switch back because at the time I just couldn’t afford spending any time trying to fix them. It appeared that some host/device vector constructors were not available anymore or something along those lines. – Leo Jun 11 '12 at 16:52
  • Leo, To comment on a specific answer, please add a comment to the answer using the "add comment" link under the answer. – Heatsink Jun 11 '12 at 17:57
  • Heatsink, I completely agree with you, but for some reason there is no Add Comment link/button for me under the talonmies answer. Don't know why... – Leo Jun 11 '12 at 20:48
  • talonmies, I accepted your answer. Thank you for your help. – Leo Jun 12 '12 at 22:16

1 Answers1

2

Thrust is intended to be used like a black box and there is no documentation of the memory overheads of the various algorithms that I am aware of. But it doesn't sound like a very difficult problem to deduce it empirically by running a few numerical experiments. You might expect the memory consumption of a particular alogrithm to be approximable as:

total number of words of memory consumed = a + (1 + b)*N

for a problem with N input words. Here a will be the fixed overhead of the algorithm and 1+b the slope of best fit memory versus N line. b is then the amount of overhead the algorithm per input word.

So the question then becomes how to monitor the memory usage of a given algorithm. Thrust uses an internal helper function get_temporary_buffer to allocate internal memory. The best idea would be to writeyour own implementation of get_temporary_buffer which emits the size it has been called with, and (perhaps) uses a call to cudaGetMemInfo to get context memory statistics at the time the function gets called. You can see some concrete examples of how to intercept get_temporary_buffer calls here.

With a suitably instrumented allocator and some runs with it at a few different problem sizes, you should be able to fit the model above and estimate the b value for a given algorithm. The model can then be used in your code to determine safe maximum problem sizes for a given about of memory.

I hope this is what you were asking about...

talonmies
  • 70,661
  • 34
  • 192
  • 269