4

I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):

Application startup - Less than 1 Mb total

OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total

CUDA 8.0 context (Driver API) creation 114 Mb total.

If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.

So after OpenGL and CUDA context are up, the process already consumes 114 Mb.

Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.

I would like to know:

  1. Why GL/CUDA context creation consumes such a considerable amount of memory?

  2. Are there ways to optimize that?

The system setup under test: Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.

I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.

UPDATE

I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):

enter image description here

I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Michael IV
  • 11,016
  • 12
  • 92
  • 223
  • Which CUDA API are you using and when are you measuring the memory footprint? – talonmies Nov 23 '17 at 21:06
  • Driver API. As I said,I measure using the Diagnostic tools,after process launch. – Michael IV Nov 23 '17 at 21:08
  • But have you loaded anything into the context when you measure? – talonmies Nov 23 '17 at 21:27
  • Nope,that's why I called it "empty" contexts :) In GL case I loaded only API function pointers (GLEW) – Michael IV Nov 23 '17 at 21:30
  • @MichaelIV ultimately I think you have to ask the NVidia driver devs about this. And those are not of the talkative kind. As far as what's exactly going down we can only speculate. At least for CUDA I know the driver library (cuda.dll / libcuda.so) carries with it a couple of MiB of GPU code for standard tasks. This code is packed and/or in a intermediary form and gets unpacked upon startup; that unpacked part then sits around in memory. I suspect similar to happen with OpenGL. What puzzles me though is, that this is not relegated to a driver helper and reused by means of RO shared memory. – datenwolf Nov 23 '17 at 23:15
  • @MichaelIV I don't have a Windows system with NVidia GPU at hand right now, otherwise I'd check myself. Could you please take a look if those large-ish parts of resource allocation are actually process exclusive? I'd strike me odd, if the NVidia devs didn't think of putting that part into shared memory. – datenwolf Nov 23 '17 at 23:17
  • @MichaelIV just FYI: At least with Linux a lot of that stuff seems to be kept in a shared memory segment. Which makes lots of sense. – datenwolf Nov 23 '17 at 23:40
  • CUDA context establishment will pre-allocate things like the printf buffer and the runtime heap which is used for kernel malloc, There are APIs to control the size of those things – talonmies Nov 24 '17 at 06:25
  • @datenwolf I guess it is a waste of time asking a Linux dude like you: "how to take a look if those large-ish parts of resource allocation are actually process exclusive?" on Windows ;) – Michael IV Nov 24 '17 at 13:39
  • 2
    @MichaelIV 1. Install [Process Explorer from Microsoft Technet](https://learn.microsoft.com/en-us/sysinternals/downloads/process-explorer) 2. after launching it in the process table right-click the table header → select columns 3. in the tab "Process Memory" select "WS Sharable Bytes" and "WS Shared Bytes" (WS = working set) 4. Apply. Also you can open a property page for each process and in the "Performance" tab see, how much of the reserved working set memory is shared. – datenwolf Nov 24 '17 at 15:14
  • @datenwolf I added more info, after checking with Process Explorer. Would be great to get your comment on this one. Thanks. – Michael IV Feb 18 '18 at 19:40
  • Have you considered filing a bug report with NVIDIA about this issue? – einpoklum Feb 01 '21 at 22:37
  • @einpoklum Well, this is an old question. Since then I moved to Vulkan which comes with its own driver-side weirdness (like device setup latency on AWS server). I never filed a bug regarding this one. – Michael IV Feb 02 '21 at 09:27
  • Part of this memory could be memory mapped GPU memory. Does GLEW registered any texture? – Bruno Coutinho Feb 06 '21 at 09:52

1 Answers1

1

Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.


I'm using CUDA (not OpenGL) on GNU/Linux:

  • CUDA version: 10.2.89
  • OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
  • Kernel: Linux 5.2.0
  • Processor: Intel x86_64

To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):

#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>

static void print_allocation_stats(const char* s)
{
    printf("%s:\n", s);
    printf("--------------------------------------------------\n");
    malloc_stats();
    printf("--------------------------------------------------\n\n");
}

int main()
{
    display_mallinfo("Initially");

    int status = cuInit(0);
    if (status != 0 ) { return EXIT_FAILURE; }
    print_allocation_stats("After CUDA driver initialization");

    int device_id = 0;
    unsigned flags = 0;
    CUcontext context_id;
    status = cuCtxCreate(&context_id, flags, device_id);
    if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
    print_allocation_stats("After context creation");

    status = cuCtxDestroy(context_id);
    if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
    print_allocation_stats("After context destruction");
    return EXIT_SUCCESS;
}

(note that this uses a glibc-specific function, not in the standard library.)

Summarizing the results and snipping irrelevant parts:

Point in program Total bytes In-use Max MMAP Regions Max MMAP bytes
Initially 135168 1632 0 0
After CUDA driver initialization 552960 439120 2 307200
After context creation 9314304 6858208 8 6643712
After context destruction 7016448 580688 8 6643712

So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.

It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.

Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • 1
    Your app is text only, it would use more memory if you open a window and create a OpenGL context. I suspect in your case Cuda is bypassing the entire graphic stack and in Widows it can't do that. – Bruno Coutinho Feb 06 '21 at 09:43
  • @BrunoCoutinho: First, it's true that I didn't address the question of OpenGL contexts and I'll clarify this. But - it is unlikely a CUDA context would take more memory if an OpenGL context existed. CUDA is not (directly) about graphics, and a CUDA context is likely oblivious - as you create it - to whether or not you're doing something graphical. Of course it could be the case that Windows is forcing CUDA to go through the "graphics stack" - I wouldn't know since I'm not a Windows developer. – einpoklum Feb 06 '21 at 10:06
  • Hi @einpoklum . I am specifically talking about graphics applications.For instance,CUDA context is used to provide interop between OpenGL and NVIDIA Video Encoder API. Also,maybe since then the drivers got optimized. This question is almost 4 years old. – Michael IV Feb 06 '21 at 20:46
  • @MichaelIV: I realize it's an old question by now, but - I only noticed it a couple of days ago :-) I also mentioned the possibility of driver improvements between CUDA 8 and 10. – einpoklum Feb 06 '21 at 21:44