I have 3 different thrust-based implementations that perform certain calculations: first is the slowest and requires the least of GPU memory, second is the fastest and requires the most of GPU memory, and the third one is in-between. For each of those I know the size and data type for each device vector used so I am using vector.size()*sizeof(type) to roughly estimate the memory needed for storage.
So for a given input, based on its size, I would like to decide which implementation to use. In other words, determine the fastest implementation that will fit is in the available GPU memory.
I think that for very long vectors that I am dealing with, the size of the vector.data() that I am calculating is a fairly good estimate and the rest of the overhead (if any) could be disregarded.
But how would I estimate the memory usage overhead (if any) associated with the thrust algorithms implementation? Specifically I am looking for such estimates for transform, copy, reduce, reduce_by_key, and gather. I do not really care about the overhead that is static and is not a function of the algorithm input and output parameters sizes unless it’s very significant.
I understand the implication of the GPU memory fragmentation, etc. but let’s leave this aside for a moment.
Thank you very much for taking the time to look into this.