4

I was hoping someone can help me make some progress in some texture benchmarks I'm doing in OpenGL ES 2.0 on and iPhone 4.

I have an array that contains sprite objects. the render loop cycles through all the sprites per texture, and retrieves all their texture coords and vertex coords. it adds those to a giant interleaved array, using degenerate vertices and indices, and sends those to the GPU (I'm embedding code are the bottom). This is all being done per texture so I'm binding the texture once and then creating my interleave array and then drawing it. Everything works just great and the results on the screen are exactly what they should be.

So my benchmark test is done by adding 25 new sprites per touch at varying opacities and changing their vertices on the update so that they are bouncing around the screen while rotation and running OpenGL ES Analyzer on the app.

Heres where I'm hoping for some help.... I can get to around 275 32x32 sprites with varying opacity bouncing around the screen at 60 fps. By 400 I'm down to 40 fps. When i run the OpenGL ES Performance Detective it tells me...

The app rendering is limited by triangle rasterization - the process of converting triangles into pixels. The total area in pixels of all of the triangles you are rendering is too large. To draw at a faster frame rate, simplify your scene by reducing either the number of triangles, their size, or both.

Thing is i just whipped up a test in cocos2D using CCSpriteBatchNode using the same texture and created 800 transparent sprites and the framerate is an easy 60fps.

Here is some code that may be pertinent...

Shader.vsh (matrixes are set up once in the beginning)

void main()
{
    gl_Position = projectionMatrix * modelViewMatrix * position;
    texCoordOut = texCoordIn;
    colorOut = colorIn;
}

Shader.fsh (colorOut is used to calc opacity)

void main()
{
    lowp vec4 fColor = texture2D(texture, texCoordOut);
    gl_FragColor = vec4(fColor.xyz, fColor.w * colorOut.a);
}

VBO setup

    glGenBuffers(1, &_vertexBuf);
    glGenBuffers(1, &_indiciesBuf);
    glGenVertexArraysOES(1, &_vertexArray);

    glBindVertexArrayOES(_vertexArray);

    glBindBuffer(GL_ARRAY_BUFFER, _vertexBuf);
    glBufferData(GL_ARRAY_BUFFER, sizeof(TDSEVertex)*12000, &vertices[0].x, GL_DYNAMIC_DRAW);
    glEnableVertexAttribArray(GLKVertexAttribPosition);
    glVertexAttribPointer(GLKVertexAttribPosition, 2, GL_FLOAT, GL_FALSE, sizeof(TDSEVertex), BUFFER_OFFSET(0));

    glEnableVertexAttribArray(GLKVertexAttribTexCoord0);
    glVertexAttribPointer(GLKVertexAttribTexCoord0, 2, GL_FLOAT, GL_FALSE, sizeof(TDSEVertex), BUFFER_OFFSET(8));

    glEnableVertexAttribArray(GLKVertexAttribColor);
    glVertexAttribPointer(GLKVertexAttribColor, 4, GL_FLOAT, GL_FALSE, sizeof(TDSEVertex), BUFFER_OFFSET(16));

    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, _indiciesBuf);
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(ushort)*12000, indicies, GL_STATIC_DRAW);

    glBindVertexArrayOES(0);

Update Code

    /*

        Here it cycles through all the sprites, gets their vert info (includes coords, texture coords, and color) and adds them to this giant array
        The array is of...
        typedef struct{
             float x, y;
             float tx, ty;
             float r, g, b, a;
        }TDSEVertex;
    */

     glBindBuffer(GL_ARRAY_BUFFER, _vertexBuf);
    //glBufferSubData(GL_ARRAY_BUFFER, sizeof(vertices[0])*(start), sizeof(TDSEVertex)*(indicesCount), &vertices[start]);
    glBufferData(GL_ARRAY_BUFFER, sizeof(TDSEVertex)*indicesCount, &vertices[start].x, GL_DYNAMIC_DRAW);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

Render Code

    GLKTextureInfo* textureInfo = [[TDSETextureManager sharedTextureManager].textures objectForKey:textureName];
    glBindTexture(GL_TEXTURE_2D, textureInfo.name);

    glBindVertexArrayOES(_vertexArray);
    glDrawElements(GL_TRIANGLE_STRIP, indicesCount, GL_UNSIGNED_SHORT, BUFFER_OFFSET(start));
    glBindVertexArrayOES(0);

Heres a screenshot at 400 sprites (800 triangles + 800 degenerate triangles) to give an idea of the opacity layering as the textures are moving... Again i should note that a VBO is being created and sent per texture so Im binding and then drawing only twice per frame (since there are only two textures).

screenshot showing layering of sprites

Sorry if this is overwhelming but its my first post on here and wanted to be thorough. Any help would be much appreciated.

PS, i know that i could just use Cocos2D instead of writing everything from scratch, but wheres the fun(and learning) in that?!

UPDATE #1 When i switch my fragment shader to only be

    gl_FragColor = texture2D(texture, texCoordOut);

it gets to 802 sprites at 50fps (4804 triangles including degenerate triangles), though setting sprite opacity is lost.. Any suggestions as to how I can still handle opacity in my shader without running at 1/4th the speed?

UPDATE #2 So i ditched GLKit's View and View controller and wrote a custom view loaded from the AppDelegate. 902 sprites with opacity & transparency at 60fps.

yiannis
  • 41
  • 1
  • 3
  • Rather than using the OpenGL ES Performance Detective, I prefer to use Instruments with the OpenGL ES Analyzer and the OpenGL ES Driver instruments. The Analyzer can usually point out more subtle rendering issues, and the driver can give you the percentage load on the Tiler (vertex side) and Renderer (fragment side) to verify where your bottleneck is. Also try running Time Profiler against this. This seems slower than it should be, because I benchmarked the A4 as running at ~1.8 million triangles per second using VBOs and simple shading, or ~30,000 onscreen at 60 FPS. – Brad Larson Mar 20 '12 at 20:07
  • Thanks for the reply @BradLarson. I use both Analyzer and Driver non stop to the best of my ability, though there are things i don't understand as much... Renderer Utiliz: 99%, Tiler Utilz: 8%, Device Utilz: 99%. Thoughts? The only thing that stands out in Analyzer are the redundant calls, but the vast majorityare from glView (binding frame buffer, etc...). Time Profilier shows 83% comes from when i bind and update the VBO (done in update) and draw. Do you think that using glView is part of the problem? I assumed it was the layers of transp (im attaching a screenshot and my updated code). Thx. – yiannis Mar 20 '12 at 21:23
  • If your Renderer Utilization is indeed at 99%, and your Tiler at 8%, you're clearly fill-rate limited, not geometry limited. Optimizing triangles and your VBOs won't do you much good, you need to focus on drawing fewer pixels or making the pixel drawing faster. You've got a lot of blending going on in the scene above, [from experience](http://stackoverflow.com/q/6051237/19679) I know this to be horribly expensive if a bunch of blended objects pile up on top of one another. You could render opaque regions and write to the depth buffer, then read from it when blending translucent ones. – Brad Larson Mar 21 '12 at 14:43
  • That makes a lot of sense. From my understanding id use two different fragment shaders, one to go through the entire scene and draw it out and the next one to handle the transparency against the previous results. I assume that rather than draw it out you mean that i should write it to the depth buffer. Any links to resources that i can look at to get a better understanding of how to write/and correctly access that data? Thanks again... – yiannis Mar 22 '12 at 20:51
  • My answer here: http://stackoverflow.com/a/6170939/19679 summarizes the depth-writing approach that Tommy and others suggested, which is used in this application (source is available at the link): http://www.sunsetlakesoftware.com/molecules . Basically, you use a prepass with `glDepthMask(GL_TRUE);` for all of the definitely opaque areas, then use `glDepthMask(GL_FALSE);` and render the remainder of your blended scene. This led to an over sixfold improvement in my case. – Brad Larson Mar 22 '12 at 21:41
  • Cant wait to get some time to try this out. I should add that i quickly whipped up a custom View with an CAEAGLLayer rather than using the GLKit View and GLKViewController to see if there was a difference..902 sprites all but 2 are 50% transparent using the original shader, 60fps. over 200% performance boost. Lesson learned about GLKKit under heavy strain. I can only imagine how implementing your suggestions will affect the performance (although realistically, 900 transparent sprites ONSCREEN) exceeds my requirements :) Thank you again for your input and time. Ill post back with results. – yiannis Mar 23 '12 at 22:38

1 Answers1

1

Mostly miscellaneous thoughts...

If you're triangle limited, try switching from GL_TRIANGLE_STRIP to GL_TRIANGLES. You're still going to need to specify exactly the same number of indices — six per quad — but the GPU never has to spot that the connecting triangles between quads are degenerate (ie, it never has to convert them into zero pixels). You'll need to profile to see whether you end up paying a cost for no longer implicitly sharing edges.

You should also shrink the footprint of your vertices. I would dare imagine you can specify x, y, tx and ty as 16-bit integers, and your colours as 8-bit integers without any noticeable change in rendering. That would reduce the footprint of each vertex from 32 bytes (eight components, each four bytes in size) to 12 bytes (four two-byte values plus four one-byte values, with no padding needed because everything is already aligned) — cutting almost 63% of the memory bandwidth costs there.

As you actually seem to be fill-rate limited, you should consider your source texture too. Anything you can do to trim its byte size will directly help texel fetches and hence fill rate.

It looks like you're using art that is consciously about the pixels so switching to PVR probably isn't an option. That said, people sometimes don't realise the full benefit of PVR textures; if you switch to, say, the 4 bits per pixel mode then you can scale your image up to be twice as wide and twice as tall so as to reduce compression artefacts and still only be paying 16 bits on each source pixel but likely getting a better luminance range than a 16 bpp RGB texture.

Assuming you're currently using a 32 bpp texture, you should at least see whether an ordinary 16 bpp RGB texture is sufficient using any of the provided hardware modes (especially if the 1 bit of alpha plus 5 bits per colour channel is appropriate to your art, since that loses only 9 bits of colour information versus the original while reducing bandwidth costs by 50%).

It also looks like you're uploading indices every single frame. Upload only when you add extra objects to the scene or if the buffer as last uploaded is hugely larger than it needs to be. You can just limit the count passed to glDrawElements to cut back on objects without a reupload. You should also check whether you actually gain anything by uploading your vertices to a VBO and then reusing them if they're just changing every frame. It might be faster to provide them directly from client memory.

Tommy
  • 99,986
  • 12
  • 185
  • 204
  • Thanks @Tommy. Because this is being done from the iPhone i don't think i can provide the vertices directly form my code (i assume using glBegin). I did try the GL_TRIANGLES method and though it cut down many triangles it didn't do much else. I would like to use the PVR, but I've been having troubles with the texture tool. I did reduce the texture size from 37k to 9k but still no difference. Im gonna try and convert those images to 16bpp but its just been such a pain! Thanks again for your effort though. – yiannis Mar 21 '12 at 05:50
  • Using a VBO is an optional alternative to just supplying the arrays directly. The reason you have to supply offsets as though they were a pointer to `glVertexAttribPointer` is that you're permitted just to give it a pointer to some client-side memory. My suggestion is that you experiment with cutting out the `_vertexBuf` VBO and the corresponding `glBufferData` call. You should be able to convert to 16 bpp at runtime; Apple used to supply a `Texture2D` class in a bunch of example projects that would do that but I guess it's vanished now that GLKit is on the scene... – Tommy Mar 21 '12 at 18:01