3

I 'm using the cuda/thrust library to do some Monte Carlo simulations. This works very well up to a certain number of simulations where I get a bad_alloc exception. This seems alright because an increasing number of simulations in my code means handling increasingly large device_vectors. So I expect this kind of exception to show up at some point.

What I'd like to do now is to set an upper limit on this number of simulations based on the available memory on my GPU. Then, I could split the workload in bunches of simulations.

So I've been trying to size my problem before launching my set of simulations. Unfortunately, when I'm trying to understand the way the memory is managed with simple examples I get surprising results.

Here is an example of code I have been testing:

#include <cuda.h>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <cuda_profiler_api.h>

int main() 
{
    size_t freeMem, totalMem;

    cudaDeviceReset();
    cudaSetDevice(0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << "Total Memory | Free Memory "<< std::endl;
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1k(1000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec100k(100000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1M(1000000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    return 0;
}

And here are the results I get:

Total Memory | Free Memory
2147483648, 2080542720
2147483648, 2079494144
2147483648, 2078445568
2147483648, 2074382336

So, basically,

  • the 1,000 element vector (plus everything else needed) uses 1,048,576 bytes
  • the 100,000 element vector uses also 1,048,576 bytes!
  • the 1,000,000 element vector uses 4,063,232 bytes.

I would have expected the memory usage to scale roughly with the number of elements but I get a "4x" when I expected a "10x", and this relationship does not hold between 1,000 and 100,000 elements.

So, my 2 questions are:

  • Can anyone help me understand those numbers?
  • If I can't estimate the proper amount of memory my code will use, then, what would be the good strategy to ensure my program will fit in memory?

Edit

Following Mai Longdong comment, I tried with two vectors, one of 262144 floats (4 bytes) and the other of 262145. Unfortunately, things don't look like a straight "per 1MB page allocation" :

  • size of the 1st vector (262144 floats) : 1048576 bytes
  • size of the 2nd vector (262145 floats) : 1179648 bytes

Delta between the two is 131072 bytes (or 128 KB). The page size would be variable? Does this make sense?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Tikoloche
  • 351
  • 1
  • 14
  • did you read https://devtalk.nvidia.com/default/topic/525397/cudamemgetinfo-returns-wrong-amount-free-memory/ ? – m.s. Aug 10 '15 at 12:44
  • Well, I just did and it was very informative, thanks a lot. It seems that the memory allocation process is far from being linear. – Tikoloche Aug 10 '15 at 13:00
  • Or [this one on Stack Overflow](http://stackoverflow.com/q/10394755/681865) – talonmies Aug 10 '15 at 13:03
  • ok, I think the main idea is that, as long as the device_vector size is below some threshold value and there is plenty of memory available, a rather large default piece of memory is allocated. – Tikoloche Aug 10 '15 at 14:12
  • The CUDA allocator uses pages, like most modern allocators. In this case it seems the page size is 1 MB. An easy way to "confirm" that (it's still unspecified behaviour, but you can observe it) would be to allocate a vector of 262144 elements and one of 262145. The latter should use twice the memory. – user703016 Aug 13 '15 at 06:41
  • Thanks for your comment, I edit my question following your suggestion. – Tikoloche Aug 13 '15 at 07:15
  • 1
    Re your edit: any multiple of 64 kB would make sense, actually. I've answered a question in the past where the page size was 64 kB. Anyway, this is not documented. – user703016 Aug 13 '15 at 11:27

1 Answers1

2

Thrust doesn't do anything magic with memory management, the default allocator is just cudaMalloc, and what you are seeing is the driver memory manager page size selection algorithm at work. This isn't documented, and there is no indication that behaviour is consistent between platform and hardware versions.

That said, if I expand your code into something a bit more useful:

#include <iostream>
#include <vector>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>

void report_mem(size_t allocd, bool first=false)
{
    size_t freeMem, totalMem;
    cudaMemGetInfo(&freeMem, &totalMem);
    if (first) 
        std::cout << "Allocated | Total Memory | Free Memory "<< std::endl;
    std::cout << allocd << ", " << totalMem << ", " << freeMem << std::endl;
}

int main() 
{
    cudaSetDevice(0);

    report_mem(0, true);
    std::vector<size_t> asizes;
    const int nallocs = 10;
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<14);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<16);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<18);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<20);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<22);

    typedef thrust::device_vector<float> dvecf_t;
    std::vector<dvecf_t*> allocs;
    auto it = asizes.begin();
    for(; it != asizes.end(); ++it) {
        dvecf_t* v = new dvecf_t(*it);
        allocs.push_back(v);
    report_mem(v->capacity() * sizeof(float));
    }
    return 0;
}

and run it on a compute 2.1 device on Windows 64 bit, I get this:

Allocated | Total Memory | Free Memory 
0, 1073741824, 1007849472
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1003655168
262144, 1073741824, 1003655168
1048576, 1073741824, 1002606592
1048576, 1073741824, 1001558016
1048576, 1073741824, 1000509440
1048576, 1073741824, 999460864
1048576, 1073741824, 998412288
1048576, 1073741824, 997363712
1048576, 1073741824, 996315136
1048576, 1073741824, 995266560
1048576, 1073741824, 994217984
1048576, 1073741824, 993169408
4194304, 1073741824, 988975104
4194304, 1073741824, 984780800
4194304, 1073741824, 980586496
4194304, 1073741824, 976392192
4194304, 1073741824, 972197888
4194304, 1073741824, 968003584
4194304, 1073741824, 963809280
4194304, 1073741824, 959614976
4194304, 1073741824, 955420672
4194304, 1073741824, 951226368
16777216, 1073741824, 934449152
16777216, 1073741824, 917671936
16777216, 1073741824, 900894720
16777216, 1073741824, 884117504
16777216, 1073741824, 867340288
16777216, 1073741824, 850563072
16777216, 1073741824, 833785856
16777216, 1073741824, 817008640
16777216, 1073741824, 800231424

which I interpret as indicating that the allocation granularity is 1MiB (1048576 or 2^20 bytes) on the platform I tested this on. Your platform might be different.

talonmies
  • 70,661
  • 34
  • 192
  • 269