1

I was trying the first example of the official website's example https://developer.nvidia.com/thrust and changed the vector size to 32<<23. The code is like:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>
#include <time.h>

using namespace std;

int main(void){
  // generate random numbers serially
  thrust::host_vector<int> h_vec(32 << 23);
  std::generate(h_vec.begin(), h_vec.end(), rand);
  std::cout << "1." << time(NULL) << endl;

  // transfer data to the device
  thrust::device_vector<int> d_vec = h_vec;
  cout << "2." << time(NULL) << endl;
  // sort data on the device (846M keys per second on GeForce GTX 480)
  thrust::sort(d_vec.begin(), d_vec.end());
  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
  std::cout << "3." << time(NULL) << endl;

  return 0;
}

But the program crashed when running to the line of thrust::sort. I tried to alternatively use std::vector and std:sort and it worked well.

Is this a bug of thrust?? I am using Thrust 1.7 + Cuda 6.5 + Visual Studio 2013 Update 2.

I was using GeForce GT 740M with a total memory of 2048M.

I used processexplorer to monitor the process and saw it allocated 1.0G memory. But I have 2G GPU memory, 16G main CPU memory.

The error message is "A problem caused the program to stop working correctly. Windows will close the program and notify you if a solution is available. [Debug] [Close Program]". After clicking [Debug], I could see the call stack. The issue is from this line:

thrust::device_vector<int> d_vec = h_vec;

The last source from cuda is this:

testcuda.exe!thrust::system::cuda::detail::malloc<thrust::system::cuda::detail::tag>(thrust::system::cuda::detail::execution_policy<thrust::system::cuda::detail::tag> & __formal, unsigned __int64 n) Line 48  C++

It is seems a memory allocation issue. But I have 2G GPU memory, 16G main CPU memory. Why??

To Robert:

The original example works well, even for 32<<21, 32<<22. Is there a virtual memory management system for GPU memory? Is CONTINUOUS here means physically continuous or virtually? Is there any exception raised in this scenario then I can catch it?

My test code is herer: https://github.com/henrywoo/wufuheng/blob/master/testcuda.cu

In my test, there is no exception, but a runtime error.

Wu Fuheng
  • 317
  • 1
  • 3
  • 9
  • 1
    `h_vec(32 << 23)` will try do allocate a 270 million element array. Is there a oom error thrown? – mrVoid Aug 18 '14 at 09:29
  • Maybe your hardware can't handle a 1 GB vector. – molbdnilo Aug 18 '14 at 09:30
  • To write a better question, instead of saying "The program crashed", paste the actual error output into your question (you can edit your own question.) Also indicate which GPU you are running this on. Did the code work correctly with the original vector size of `32<<20` ? If so, it's likely you are out of GPU memory. – Robert Crovella Aug 18 '14 at 10:13

1 Answers1

1

sizeof(int) * 32<<23 = 4* 2^28
I.e. you are allocating about 1 GB of GPU RAM. Most likely, your card cannot handle that many elements. This might be because:
  • there isn't enough GPU RAM in general
  • there isn't enough continuous free GPU RAM (this is needed because the vector has to fit in a continuous piece of memory)
anderas
  • 5,744
  • 30
  • 49
  • I have 2G GPU memory, 16G main CPU memory. How to check if there is enough continuous free GPU RAM or not. – Wu Fuheng Aug 18 '14 at 13:06
  • 1
    I'm afraid I don't know of any. However, from my experience it is very likely that with 2G RAM, finding 1G of free RAM is hardly ever possible. Also, this is most likely reason given the fact that the failure is in `malloc`. – anderas Aug 18 '14 at 14:00
  • 1
    Sort on the device [requires O(N) temporary storage](http://stackoverflow.com/questions/6605498/thrust-sort-by-key-slow-due-to-memory-allocation). Asking to sort a vector of 1GB requires additional temporary storage of 1GB, ie. total of about 2GB. Your 2GB GPU doesn't have that much available, due to display overhead and other reasons. You can query the free memory with a [cuda API call](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd5d6772f4b2f3355078ecd6059e6aa74) but it may not all be available in a single allocation due to fragmentation. – Robert Crovella Aug 18 '14 at 15:23
  • As I mentioned in my post, the crash happens when copying memory from main RAM to GPU RAM, before the sort function is invoked. – Wu Fuheng Aug 19 '14 at 00:05
  • It's quite possible you don't have enough free memory. You never answered my question as to whether a sort of the original size shown in the example code (32<<20) would work correctly or not. And you can query free memory as I indicated already. – Robert Crovella Aug 19 '14 at 00:16
  • The original example works well, even for 32<<21, 32<<22. Is there a virtual memory management system for GPU memory? Is CONTINUOUS here means physically continuous or virtually? Is there any exception raised in this scenario then I can catch it? – Wu Fuheng Aug 19 '14 at 04:30
  • There is not a virtual memory management system for GPU memory (ignoring WDDM, which is not at issue here). An allocation request for a particular size will fail if a contiguous block of memory is not available in that size or larger. Yes, thrust will [throw an exception](https://github.com/thrust/thrust/wiki/Debugging) when an allocation fails. – Robert Crovella Aug 21 '14 at 00:10
  • Thanks. But I tested, the thrust::system_error doesn't work. There is no exception thrown out. It is just a runtime error. The program is just terminated. Have you tested ever? My code is available here: https://github.com/henrywoo/wufuheng/blob/master/testcuda.cu – Wu Fuheng Aug 22 '14 at 01:31
  • `thrust::system_error` is not the issue. Please re-read the link I provided. The error is an allocation error and you should be trapping `std::bad_alloc` I tested your code for size `32<<24` on a 3GB GPU, and it threw `std::bad_alloc`. I don't have a 2GB GPU to test with. (Your code as-is does not have any allocation issues on my 3GB GPU). I modified your code to catch bad_alloc, and it seems to work correctly for me. [Here](http://pastebin.com/UkE97TFC) is an example. With size of `32<<23`, on my 3GB GPU, your code just prints out: `1.1408741030 2.1408741037 4.1408741037` – Robert Crovella Aug 23 '14 at 01:37
  • Hi Robert, you are correct - the exception is std::bad_alloc. – Wu Fuheng Aug 23 '14 at 12:13