I 'm using the cuda/thrust library to do some Monte Carlo simulations. This works very well up to a certain number of simulations where I get a bad_alloc exception. This seems alright because an increasing number of simulations in my code means handling increasingly large device_vectors. So I expect this kind of exception to show up at some point.
What I'd like to do now is to set an upper limit on this number of simulations based on the available memory on my GPU. Then, I could split the workload in bunches of simulations.
So I've been trying to size my problem before launching my set of simulations. Unfortunately, when I'm trying to understand the way the memory is managed with simple examples I get surprising results.
Here is an example of code I have been testing:
#include <cuda.h>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <cuda_profiler_api.h>
int main()
{
size_t freeMem, totalMem;
cudaDeviceReset();
cudaSetDevice(0);
cudaMemGetInfo(&freeMem, &totalMem);
std::cout << "Total Memory | Free Memory "<< std::endl;
std::cout << totalMem << ", " << freeMem << std::endl;
thrust::device_vector<float> vec1k(1000, 0);
cudaMemGetInfo(&freeMem, &totalMem);
std::cout << totalMem << ", " << freeMem << std::endl;
thrust::device_vector<float> vec100k(100000, 0);
cudaMemGetInfo(&freeMem, &totalMem);
std::cout << totalMem << ", " << freeMem << std::endl;
thrust::device_vector<float> vec1M(1000000, 0);
cudaMemGetInfo(&freeMem, &totalMem);
std::cout << totalMem << ", " << freeMem << std::endl;
return 0;
}
And here are the results I get:
Total Memory | Free Memory
2147483648, 2080542720
2147483648, 2079494144
2147483648, 2078445568
2147483648, 2074382336
So, basically,
- the 1,000 element vector (plus everything else needed) uses 1,048,576 bytes
- the 100,000 element vector uses also 1,048,576 bytes!
- the 1,000,000 element vector uses 4,063,232 bytes.
I would have expected the memory usage to scale roughly with the number of elements but I get a "4x" when I expected a "10x", and this relationship does not hold between 1,000 and 100,000 elements.
So, my 2 questions are:
- Can anyone help me understand those numbers?
- If I can't estimate the proper amount of memory my code will use, then, what would be the good strategy to ensure my program will fit in memory?
Edit
Following Mai Longdong comment, I tried with two vectors, one of 262144 floats (4 bytes) and the other of 262145. Unfortunately, things don't look like a straight "per 1MB page allocation" :
- size of the 1st vector (262144 floats) : 1048576 bytes
- size of the 2nd vector (262145 floats) : 1179648 bytes
Delta between the two is 131072 bytes (or 128 KB). The page size would be variable? Does this make sense?