2

I created a voxel world using OpenGL ES 2.0 using a VBO to store a basic cube and using a different position matrix for each cube. I am able to get 30fps on my Galaxy S3 when there are 500-600 cubes being rendered, but anything more than 1500 cubes isn't able to run at a faster rate than 8 fps. This is unacceptable because the voxel world should be able to handle more than 5,000 voxels being rendered at a stable 30fps. I have played other mobile games on my phone that run at good framerates and render much more than 5000 blocks at a time. What kind of techniques would be best for getting good performance?

Here is what I have set up in more detail: There is one VBO containing vertex information for a basic cube. Each block has its own matrix that is translated to the block's position in world space (This matrix is calculated only once when the block is created). The block calls glDrawArrays to draw the cube using its position matrix. Unfortunately this means there are thousands of calls to glDrawArrays in each frame.

Is there a better technique to this? I don't know how to group all the blocks into one single call to glDrawArrays because that would mean the VBO would need a huge allocation, to add all the vertex data for every single cube, and it is impossible to know how much space the VBO would need before drawing them. What I was thinking was to allocate a VBO for every 500 or so blocks so that if it needs more space for blocks it can always create a new VBO for it. And this way it wouldn't be allocating too much extra space since it will only allocate enough space for 500 blocks, and this way if we have 5000 blocks in the world, there will be only 10 calls to glDrawArrays instead of having thousands of those calls.

Another idea I have is that instead of having a VBO for the cube, I could make a VBO for a quad, and use a transformation matrix on each quad. This would require even more calls to glDrawArrays since I would have to call it for each face of the cube, but the plus side is that this way I can remove the faces that already have a block next to them. For the floor level, each block has 4 blocks surrounding it, so those 4 faces don't actually need to be drawn. This would save drawing those 4 quads for each block, but it would require more than double the amount of glDrawArrays calls. To reduce the amount of glDrawArrays calls I could create a new VBO for every 500 or so quads, and add/remove quads to the current VBOs whenever necessary. This would reduce the amount of glDrawArrays calls, but it would mean that I have to group each quad based on its texture, which is another issue because if I have to create a VBO for each texture, that would require me to allocate a lot of extra unnecessary space because there might be just one block that uses a certain texture and I may end up allocating space for 500 blocks for that texture.

These are my thoughts on some of the methods I can think of to optimise the rendering, but I don't think any of these techniques will drastically improve the fps of the game, because every method comes with its own issues. Is there anything that I have not thought of that could be a better solution?

EDIT: I switched to rendering quads instead of cubes because this way I can skip over the faces that are not visible. After that I also added frustum culling so that only blocks visible inside the frustum are shown. This increased the performance so that I can render a decent sized world at 30 fps now. But I think there is still a lot of room for improvement, because there are currently 23,000 calls to glDrawArrays(GL_TRIANGLES) (one for each quad rendered on screen). Would switching to using glDrawArrays(GL_TRIANGLE_STRIPS) make any real difference? And also creating VBO's that hold 1,000 quads each instead of just 1 quad is a possibility, but that would mean I would have to allocate a lot more space in the VBO's. (Right now there is only one quad stored in the VBO which is transformed by a matrix to its position/rotation).

Mohsin
  • 86
  • 8

3 Answers3

2

if using Octtrees (wich is definitely THE WAY) does not suit you, you can optimize the code for calling the vbo lists.

In my work, I started with a scene rendering at 3fps rate, just optimizing the opengl calls and context switches, now runs on 53fps (wich is quite fine considering the starting point).

So, try not to change any register inside the gpu between calls:

  • order all the objects with the same shader to render them all at the time using only one glUseProgram
  • order objects with transparency, so you only draw translucent objects at the end.
  • draw objects in such a fashion that fragments are drawn only once (if a object is behind another, draw the front object first, cause depth test is faster than fragment calculation).
  • use shaders without "discard;" wich is costly for the cpu to process.
  • use reversed loops to get a little bit of cpu speed
  • dont select the texture if it is already the same than selected in the GPU (a cpu 'if' is less costly than a GPU register change).
    • try not to update the shader attributes if there is no need to (cpu if is less costly).

if you post some pieces of code I can help you better.

diego.martinez
  • 1,051
  • 2
  • 11
  • 25
0

I am currently implementing a voxel world using java on a normal PC with OpenGL 4.x.

At the beginning I had the same issue but that I followed a very basic tutorial: https://sites.google.com/site/letsmakeavoxelengine/

With one render call per chunk there is no problem having 10 Chunks of 32*32*32 Blocks rendered (FPS > 30). You should load the Chunk and only add those faces which are not occluded by other faces (so that they are visible to the player) to an array which will be uploaded to a VBO. Therefore you have one rendercall per Chunk with the minimum amout of faces

In 2D is looks like this

    _ _ _
   |B B B|
   |B B |
   |B B B|
    - - - 

There is no need to draw the faces between the outter faces. In addition you can use frustrum culling: How to check if an object lies outside the clipping volume in OpenGL?

So you just need to make a render call for those chunks which are actually inside your frustrum. Do not render chunks behind the camera. OpenGL will make a lot of calculations for all vertices of the chunk, but then the chunk is not visible so why render it in the first place. This can happen in your java code.

A third optimazation could be deferred shading: http://en.wikipedia.org/wiki/Deferred_shading

As far as I know the shading is processed before depth testing and throwing away those triangels/ faces occluded by others, you can speed up your shader using deferred shading as you only shade those vertices which will pass the depth-testing.

There are a lot of more ways to optimize voxel rendering but for me this are the most basic operations. The given tutorial behind the first link isn't finished yet, but he shows a lot of ideas for optimizing voxel rendering.

Edit: If you want to use textures, which different textures for each cube, I recommend to place all textures in a big one, so you do not need to swap textures, a simple texture lookup is much more faster than swapping a texture (glBindTexture(..)) and then make a lookup and later swap back to this texture. Use one big huge texture and apply the right UV coordinates to your vertices.

Community
  • 1
  • 1
glethien
  • 2,440
  • 2
  • 26
  • 36
  • But ... You're doing it on the PC, that's different than doing it on a mobile device. – user1095108 Aug 04 '14 at 13:28
  • I do not think so. I've used OpenGL ES 2.0 on different android devices and the techniques used are the same. I did not implement voxel rendering on android, but as you are having much more limited ressouces you should really consider optimazations and those are nearly the same as on any other device: Reduce RenderCalls and use as few as possible vertices. DeferredShading is no problem on Android, huge textures are no problem, only the CPU times used for creating the chunk can be a problem. But why shouldn't those techniques mentioned above aren't working on android? I see no reason. – glethien Aug 04 '14 at 14:39
  • Yeah the problem only comes in mobile device. I ran the same exact code (with a few slight modifications of course) on PC and had no problem running it at 60 fps with over 30,000 blocks being rendered onto the screen. On android device, I cannot even get more than 1,000 blocks rendered with a good frame rate. – Mohsin Aug 04 '14 at 18:47
  • So if I do what you said and create a VBO for each chunk, then how much space would I allocate for it? If I allocate enough space for every block in that chunk, there would be a lot of extra space being allocated since there are a lot of air blocks in each chunk. Is there a way to dynamically increase/decrease the capacity of a VBO without losing performance? – Mohsin Aug 04 '14 at 18:49
  • On the graphics device it will be #Blocks*3*6*6*2*8 / 1024 / 1024 in MB (this is for the worst case that you render every block - also the occluded ones, which will normally be much more smaller) in java you will need: 1 Byte per block with 32^3 Blocks you will need 250KB of RAM – glethien Aug 05 '14 at 12:41
0

You should use BSP Octrees to discard big blocks of offscreen cubes. You divide the world into 8 "space cubes" wich go in the different axis. Then, you check if the camera can see something inside the cube, if it can't you discard all the blocks in that section (wich can speed up to 8x). Then, inside the block, you divide again in 8 sections, and check again if they are visible. An so on, speeding checks and renders.

http://en.wikipedia.org/wiki/Octree

http://i.ytimg.com/vi/S-oIeUiw2UY/hqdefault.jpg

Octree can be accelerated using "portals" (and I dont mean GladOs ;) ) wich discard voxels and Octrees depending on the visibility inside doors and windows, but is only good for interiors.

diego.martinez
  • 1,051
  • 2
  • 11
  • 25