OpenGL Meshing Slower Than Not Meshing

Question

I'm making a 3D voxel game to learn OpenGL (think Minecraft). I know that rendering each face of each cube is slow, so I'm working on meshing. My meshing algorithm of choice is similar to greedy meshing, although it doesn't merge quads so that they all become one quad. Here's what some of my important code looks like:

void build_mesh(chunk *c) {
    if (c->meshes != NULL) {
        vector_free(c->meshes); // deleted old mesh list
    }
    c->meshes = vector_create(); // creates a new mesh list

    for (int x = 0; x < CHUNK_SIZE; x++) {
        for (int y = 0; y < CHUNK_HEIGHT; y++) {
            for (int z = 0; z < CHUNK_SIZE; z++) {
                if (c->data[x][y][z] == 1) {
                    mesh m;
                    m.pos.x = x;
                    m.pos.y = y;
                    m.pos.z = z;

                    if (x - 1 < 0 || c->data[x - 1][y][z] == 0) {
                        // if we're in here that means we have to render the quad
                        m.type = X_MIN;
                        vector_add(&c->meshes, m);
                    }

                    if (x + 1 >= CHUNK_SIZE || c->data[x + 1][y][z] == 0) {
                        m.type = X_POS;
                        vector_add(&c->meshes, m);
                    }

                    if (y - 1 < 0 || c->data[x][y - 1][z] == 0) {
                        m.type = Y_MIN;
                        vector_add(&c->meshes, m);
                    }

                    if (y + 1 >= CHUNK_HEIGHT || c->data[x][y + 1][z] == 0) {
                        m.type = Y_POS;
                        vector_add(&c->meshes, m);
                    }

                    if (z - 1 < 0 || c->data[x][y][z - 1] == 0) {
                        m.type = Z_MIN;
                        vector_add(&c->meshes, m);
                    }

                    if (z + 1 >= CHUNK_SIZE || c->data[x][y][z + 1] == 0) {
                        m.type = Z_POS;
                        vector_add(&c->meshes, m);
                    }
                }
            }
        }
    }
}

void render_chunk(chunk *c, vert *verts, unsigned int program, mat4 model, unsigned int modelLoc, bool greedy) {
    // meshing code
    if (greedy) {
        for (int i = 0; i < vector_size(c->meshes); i++) {
            glm_translate_make(model, (vec3){c->meshes[i].pos.x, c->meshes[i].pos.y, c->meshes[i].pos.z});
            setMat4(modelLoc, model);
            glBindVertexArray(verts[c->meshes[i].type].VAO);
            glDrawArrays(GL_TRIANGLES, 0, 6);
        }
        return;
    }

    for (int x = 0; x < CHUNK_SIZE; x++) {
        for (int y = 0; y < CHUNK_HEIGHT; y++) {
            for (int z = 0; z < CHUNK_SIZE; z++) {
                for (int i = 0; i < 6; i++) {
                    if (c->data[x][y][z] == 1) {
                        glm_translate_make(model, (vec3){x, y, z});
                        setMat4(modelLoc, model);

                        glBindVertexArray(verts[i].VAO);
                        glDrawArrays(GL_TRIANGLES, 0, 6);
                    }
                }
            }
        }
    }
}

build_mesh only gets called when the chunk gets updated and render_chunk gets called every frame. If greedy is true, greedy meshing is implemented. However, the problem is that greedy meshing is significantly slower than just rendering everything, which should not be happening. Does anyone have any ideas what's going on?

Edit: After timing the mesh rendering, it take ~30-40 ms per frame. However, it scales up really well and still take 30-40 ms regardless of how large the chunk is.

this [How to best write a voxel engine in C with performance in mind](https://stackoverflow.com/a/48092685/2521214) is duplicate to your question. Looks like it got deleted even the close was questionable as the close voters did not understand the question (similarly like you got close vote with debugging reason which is bullshit) at the time and reopen was not succesfull either. (question +3 and my answer +7 score) its a shame as it was quite good. You will see it only when your rep is higher... How many meshes you got and how many voxels you got? you probably have too many glDraw calls — Spektre, Mar 04 '21 at 10:19
Sorry, I'm kind of confused what you're saying at the start. Here's the amount of draw calls: `Normal: 786432 = 128 * 32 * 32 * 6` `Meshing: 18432 = surface area of 128 * 32 * 32 rectangular prism` As you can see, there are far fewer draw calls when meshing. — Nikhil Nayak, Mar 04 '21 at 16:02
Hard to say why down-vote my bet is someone just assumed (without properly reading your question) that this is debugging question (and start the close vote accordingly and also downvote ... but might be different users too) as your question lack what is wrong with your code and what have you tried. Of coarse as both of your codes work (as you described) and you just struggle with speed inconsistency this is not a debugging and do not need that stuff as its not related to the problem you got at all... — Spektre, Mar 04 '21 at 19:15
back to your question `18432` calls to `glDrawArrays` is way too much as the call it self is a performance hit alone due the way how GL works. You should group your meshes to much less VAO/VBOs ... for example 128 or less ... you can divide your voxel space into slices so if you got 128x32x32 cubes try to put 32x32 cubes into single VAO/VBOs and see if ti makes any difference in speed ... also I would get rid of the translation of cubes and store the cube vertexes into VBO already translated — Spektre, Mar 04 '21 at 19:20
Ok. After looking at what you just said, I'm going to make it so there is one VBO per chunk. Every time the chunk gets rebuilt, I will regenerate the VBO. Thanks! Also, If you want to, you can put your comment as an answer so I can flag it as solved. — Nikhil Nayak, Mar 04 '21 at 19:44
One VBO per "chunk" is exactly how I implemented my voxel engine, and remeshing upon chunk updates/generation, you're on the right track — Mudkip Hacker, Mar 04 '21 at 23:26
@NikhilNayak created an answer... but accept it only if your problem is solved by it... its possible you have more issues than just the number of `glDraws` and by Accepting answer you lower your chance to get another/better answer as many users skip questions with accepted answer (me too sometimes)... — Spektre, Mar 05 '21 at 06:36

Spektre · Answer 1 · 2021-03-05T06:45:06.517

0

18432 calls to glDrawArrays is way too much as the call it self is a performance hit alone due the way how GL works.

You should group your meshes to much less VAO/VBOs ... for example 128 or less ... you can divide your voxel space into slices so if you got 128x32x32 cubes try to put 32x32 cubes into single VAO/VBOs and see if it makes any difference in speed ... also I would get rid of the translation of cubes and store the cube vertexes into VBO already translated.

My answer in the duplicate (sadly deleted) QA:

How to best write a voxel engine in C with performance in mind

did go one step further representing your voxel space in a 3D texture where each texel represents a voxel and ray tracing it in fragment shader using just single glDraw rendering single QUAD covering the screen. Using the same techniques as Wolfenstein like ray cast just ported to 3D.

The ray tracing (vertex shader casts the start rays) stuff was ported from this:

raytrace through 3D mesh

Here preview from the deleted QA:

preview

IIRC it was 128x128x128 or 256x256x256 voxels rendered in 12.4ms (ignore the fps it was measuring something else). there where a lot of room to optimize more in the shaders as I wanted to keep them as simple and understandable as I could (so no more advanced optimizations)...

There are also other options like using point sprites, or geometry shader emitting the cubes etc ...

In case lowering the number of glDraws is not enough speed boost you might want to implement BVH structures to speed up rendering ... however for single 128x32x32 space I see no point in this as that should be handled with ease...

edited Mar 05 '21 at 06:45

answered Mar 05 '21 at 06:31

Spektre

49,595
11
110
380

Ok, I did what you said, and it's all working. However, it's still slower to render one big VBO than the thousands I was rendering individually. I don't know whether this is related to the initial problem, but without more information, I can't tell. If you need some source code for specific things, just ask and I'll give it. – Nikhil Nayak Mar 05 '21 at 07:03
@NikhilNayak and is it faster than before? how many triangles are rendered now and was before and how many with the naive approach? can you measure acual time it spend drawing with each of the 3 methods? – Spektre Mar 05 '21 at 08:40
@NikhilNayak you can measure like this [Benchmarking GLSL shaders to compare speed of alternative implementations](https://stackoverflow.com/a/37490211/2521214) as measuring time o CPU side will not measure what you need – Spektre Mar 05 '21 at 09:25
So there are two methods: Greedy meshing and just drawing everything. With just drawing everything, there are 6 VBOs (for 6 faces of a cube) that I draw for every cube in the chunk. With greedy meshing, I have 1 VBO with ~50k vertices that I only draw once. I'm working on doing benchmarks right now. – Nikhil Nayak Mar 05 '21 at 16:21
@NikhilNayak by the 3th method I meant the old greedy meshing with many VBOs ... you should clearly see time difference between single VBO of your new greedy meshing and the original 18432 VBOs – Spektre Mar 05 '21 at 16:40
Ok, this is my main loop code: https://pastebin.com/WU0tjVjS and this some of my chunk code: https://pastebin.com/Gwju2bAQ. This code is performing worse than the 18432 individually rendered quads. – Nikhil Nayak Mar 05 '21 at 17:47
@NikhilNayak not using GLFW but I think that would measure the timer your redraw is refreshed with and not the actual time of rendering. I would start measure right before `render_chunk(&test_chunk, faces_vert, program, model, modelLoc);` and stop right after it... not sure if `glfwGetTime();` is measuring CPU side or GPU side times ... You would be probably better of with the OpenGL time measurement from the last link in my previous comments pr at least add a `glFinish()` before the end time measurement – Spektre Mar 05 '21 at 17:51
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229576/discussion-between-nikhil-nayak-and-spektre). – Nikhil Nayak Mar 05 '21 at 17:59

OpenGL Meshing Slower Than Not Meshing

1 Answers1