2

I'm doing adjustments to images with OpenGL, so I need this kind of workflow:

  1. Upload image data to graphics card
  2. Transform image
  3. Download result back to main memory

Each step is going to stall until the previous step is finished. This is fine. I need to finish all of these steps as quickly as possible. Multiplexing with other operations would not be an improvement for me; I need to finish this image as quickly as possible.

Now, 2 is really quick, and 3 is not so bad, most likely because the result is a thumbnail of the original image -- vastly smaller.

1 is my bottleneck. I measure uploading 20MB of image data in 1.2 seconds. This puts me at something like 16MB/s. Elsewhere on the Internet I read about people expecting 5.5GB/s, and being disappointed by 2.5GB/s.

It doesn't matter if I use glTexImage2D directly or do it via a PBO. I have tried both, and measured no difference. This makes sense, since I'm not multiplexing with anything. For my pipeline, I am unable to use PBO without stalling immediately anyway.

The remaining explanation I can think of is this: My system is just this slow. My graphics card is an NVIDIA GPU GeForce GTX 285 (GT200), which is attached via 16x PCI-Express. Is my measured 16MB/s as fast as this is going to get, or have I overlooked something? Does there exist a utility (for Ubuntu/Linux in general) that lets me measure the maximum data rate?

I don't feel comfortable concluding that the system is this slow; after all, my network interface is enormously faster (1Gb/s ~ 125MB/s) and has only cat-5e cable to achieve this on.


Further details: The glTexImage2D case is pretty straightforward:

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, image.width, image.height, 0, GL_RGBA, GL_UNSIGNED_BYTE, rawData);

Timing this line alone measures ~1200ms.

I have also translated it to use PBO, as mentioned:

GLuint pbo = 0;
glGenBuffers(1, &pbo);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);
glBufferData(GL_PIXEL_UNPACK_BUFFER, data_size, pixels, GL_STREAM_DRAW);
glTexImage2D(target, level, internalformat, width, height, border, format, type, 0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
glDeleteBuffers(1, &pbo);    

I also tried with memorymapping:

glBufferData(GL_PIXEL_UNPACK_BUFFER, data_size, 0, GL_STREAM_DRAW);
GLubyte* ptr = (GLubyte*)glMapBufferARB(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
std::copy(pixels, pixels+data_size, ptr);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);

No appreciable difference in timing between any of the solutions.


What kind of data rates should I expect when uploading texture data?

Is 16MB/s reasonable with my setup? (I feel "no". Please tell me if it is!)

Is there a tool I can use to verify that this is the speed of my system, thereby vindicating my code, or alternatively definitely placing the blame on my code?

Community
  • 1
  • 1
Magnus Hoff
  • 21,529
  • 9
  • 63
  • 82
  • Have you verified that there is no risk for automatic format conversion when doing the upload? That conversion could conceivably be done in software by your driver, and thus be pretty slow. – unwind Apr 16 '13 at 09:47
  • @unwind I have not verified this. How would I go about doing that? It's still weird, though. I'm downloading the image over the network, decompressing the jpeg and switching all the R's with B's in 160ms. A format conversion should be extremely involved to take up to one second..? – Magnus Hoff Apr 16 '13 at 09:53
  • If this is done multiple times using the same texture object and the texture size or internal format doesn't change, then of course using `glTexSubImage2D` instead of `glTexImage2D` would be a start, since this won't completely reallocate the whole texture each time, which might come at a considrable maintenance effort on the dirver's/hardware's side. – Christian Rau Apr 16 '13 at 11:25
  • @MagnusHoff You might just try different formats for the uploaded data (not the internal format of the texture). Some platforms are happier with `GL_BGRA` format than `GL_RGBA`. And play with the layout, using rows padded to 4/8 bytes (and an appropriate `glPixelStore` setup) might also be preferred. – Christian Rau Apr 16 '13 at 11:27
  • @ChristianRau It's a one-off, so `glTexSubImage2D` won't help me either :( Will proceed with testing other texture formats. – Magnus Hoff Apr 16 '13 at 11:36
  • @MagnusHoff If you're really going to change the texture format (in constrast to the data format), try to keep both in sync, of course (a `GL_R32F` format might not play that well with `GL_RGBA`-`GL_UNSIGNED_SHORT`-data, while `GL_RGBA8` together with `GL_BGRA`-`GL_UNSIGNED_BYTE`-data should do quite well). – Christian Rau Apr 16 '13 at 11:52
  • You have not provided enough information to be able to diagnose a problem. For example, you haven't posted your *uploading code*. – Nicol Bolas Apr 16 '13 at 11:55
  • @ChristianRau Was thinking about the upload data format, as you suggested. However, I am currently using `GL_RGBA` with `GL_RGBA`-`GL_UNSIGNED_BYTE`. Should do well. Substituting `GL_BGRA` for either or both of them does not affect performance. – Magnus Hoff Apr 16 '13 at 12:07
  • @NicolBolas I am of course happy to offer further details and I'll add some to the question. However, the most important aspect of this question is this: What data rate should I expect? Am I crazy to expect more than I see? – Magnus Hoff Apr 16 '13 at 12:17
  • @MagnusHoff I hope you don't use `GL_RGBA` as texture internal format (oh wait, updated answer, you did), since that is bad practice, you should always use sized and typed internal formats (in your case most probably `GL_RGBA8`), even if it most probably won't make a difference in your particular case. And using `GL_BGRA` as internal format shouldn't even work, as the internal component ordering is completely up to your hardware, anyway. – Christian Rau Apr 16 '13 at 12:23
  • @ChristianRau `GL_BGRA` as internal format did indeed not work, but it spent the same amount of time not working ;) `GL_RGBA8` makes no difference in timing, as you suspect. – Magnus Hoff Apr 16 '13 at 12:26
  • @MagnusHoff *"but it spent the same amount of time not working"* - Well that is indeed very interresting and might point to the fact that the copying (or the texture allocation) isn't the slowing factor. When a `glTexImage2D` just returns with `GL_INVALID_ENUM`, there isn't any texture allocation or memory transfer going on. – Christian Rau Apr 16 '13 at 12:38
  • @MagnusHoff How are you actually measuring all those times? Might it be that the timing introduces unneccessary stalls or inaccuracies? For timing GPU processes, the *ARB_timer_query* extension (core since GL 3.something) is usually the best approach (but Ok, given that either `glTexImage2D` or `glBufferData` should stall anyway, normal CPU timing might also work). – Christian Rau Apr 16 '13 at 12:43
  • @ChristianRau Oh my. Prelimenary testing shows that you might be very correct. I'm working through several layers of abstraction, and the slow link may very well be elsewhere, even though timing the `glTexImage2D`-call at the high level of abstraction takes lots of time. *Digging into deeper layers* Thanks for the discussion so far. I will continue to post updates here as I discover stuff. – Magnus Hoff Apr 16 '13 at 12:56
  • @ChristianRau Yup. The time definitely goes into `new Uint8Array(image.buffer)` in one of my many layers. Thank you very much for helping me discover my mistaken assumptions! – Magnus Hoff Apr 16 '13 at 13:21

1 Answers1

0

No, I was not crazy for expecting a higher data transfer rate.

My mistake was that I timed the data upload at too high an abstraction level, so I inadvertently included new Uint8Array(image.buffer) in my timing. In one timing I saw this call taking 1190ms while glTexImage2D took 10ms.

Lesson for next time: Do the timing on the exact lines before and after the specific C call. Only then have I identified the problem.

Big thanks to @ChristianRau for handholding me through the debugging.

Magnus Hoff
  • 21,529
  • 9
  • 63
  • 82