2

I am making an application for real time mesh deformation and I need to calculate normals a huge number of times. Now the problem is I found with some profiling this bit of code is taking largest cpu time so how can I possibly optimize this?

void Mesh::RecalculateNormals()
{
        for (int i = 0; i < indexCount; i += 3)
        {
            const int ia = indices[i];
            const int ib = indices[i + 1];
            const int ic = indices[i + 2];

            const glm::vec3 e1 = glm::vec3(vert[ia].position) - glm::vec3(vert[ib].position);
            const glm::vec3 e2 = glm::vec3(vert[ic].position) - glm::vec3(vert[ib].position);
            const glm::vec3 no = cross(e1, e2);

            vert[ia].normal += glm::vec4(no, 0.0);
            vert[ib].normal += glm::vec4(no, 0.0);
            vert[ic].normal += glm::vec4(no, 0.0);
        }

        for (int i = 0; i < vertexCount; i++)
            vert[i].normal = glm::vec4(glm::normalize(glm::vec3(vert[i].normal)), 0.0f);

}

Also before this function is called I must loop through all vertices and clewr the previous normals by setting normals to vec3(0).

How can I speed this up? Could there be a better algorithm? Or is GLM slow?

Jaysmito Mukherjee
  • 1,467
  • 2
  • 10
  • 29

2 Answers2

2

There is three main ways to optimize this code: using SIMD instructions, using multiple threads and working on the memory access pattern. None of them is a silver-bullet.

Using SIMD instructions is not trivial in your case due to the indices-based indirect data-dependent read in memory in the first loop. That being said, recent SIMD instruction sets like AVX-2/AVX-512 on x86-64 processors and SVE on ARM ones provides instructions to perform gather loads. Once the values are loaded in SIMD registers, you can compute the cross product very quickly. The thing is the last indices-based indirect data-dependent stores in the first loop require scatter store instruction which are only available on very recent processors supporting AVX-512 (x86-64) and SVE (ARM). You can extract the values from SIMD registers so to store them serially but this will certainly quite inefficient. Hopefully, the second loop can be much more easily vectorized but you certainly need to reorder the normal data structure in a way that is more SIMD-friendly (see AoS vs SoA and Data-oriented design). In the end, I expect the SIMD implementation not much faster for the first loop as long as scater instructions are not used, and certainly significantly faster for the second one. Actually, I expect the SIMD implementation of the first loop not to be a lot faster even with gather/scatter instructions since such instructions tend to be quite inefficiently implemented and also because I expect the first loop of your code to cause a lot of cache misses (see the next sections).

Using multiple thread is also not trivial. Indeed, while the first part of the first loop can be trivially parallelized, the second part performing the accumulation of the normal cannot. One solution is to use atomic adds. Another solution is to use a per-thread array of normal vectors. A faster solution is to use partitioning methods so to split the mesh in independent parts (each threads can safely perform the accumulation, except for possibly-shared vect items). Note that efficiently partitioning a mesh so to balance the work between many threads is far from being simple and AFAIK it has been the subject of many past research papers. The best solution is very dependent of the number of threads, the size of the overall vert buffer in memory and your performance/complexity requirements. The second loop can be trivially parallelized. The simplest solution to parallelize the loops is to use OpenMP (few pragma are enough to parallelize the loops although an efficient implementation can be quite complex).

Regarding the input data, the first loop can be very inefficient. Indeed, ia, ib and ic can refer to very different indices causing predictable vert[...] loads/stores in memory. If the structure is big, the loads/stores will cause cache misses. Such cache misses can be very slow is the data structure does not fit in RAM due to the huge latency of the RAM. The best solution to fix this problem is to improve the locality of the memory accesses. Reordering vert items and/or indices can help a lot to improve locality and so performance. Again, partitioning methods can be used to do that. A naive first start is to sort vert so that ia is sorted while updating ib and ic indices so they are still valid. This can be computed using key-value arg-sort. If the mesh is dynamic, the problem becomes very complex to solve efficiently. Octrees and k-D trees could help to improve the locality of the computation without introducing a (too) big overhead.

Note that you may not need to recompute all the normals regarding the previous operations. If so, you can track the one that needs to be recomputed and only perform an incremental update.

Finally, please check compilers optimizations are enabled. Moreover, note that you can use the (unsafe) fash-math flags so to improve the auto-vectorization of your code. You should also check the SIMD instruction set used and that the glm calls are inlined by the compiler.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
0

Though what Jérôme Richard suggested is way better, implementing all that is way to complex for me as of now. So i tried some basic optimization and now the code is about 5 to 6 times faster !

Here is the new code:

#define VEC3_SUB(a, b, out) out.x = a.x - b.x; \
                out.y = a.y - b.y; \
                out.z = a.z - b.z;

#define VEC3_ADD(a, b, out) out.x = a.x + b.x; \
                out.y = a.y + b.y; \
                out.z = a.z + b.z;

float inline __declspec (naked) __fastcall asm_sqrt(float n)
{
    _asm fld dword ptr [esp+4]
    _asm fsqrt
    _asm ret 4
} 

#define VEC3_NORMALIZE(v, out)  \
{               \
                                 \
    float tempLength = ( (v.x) * (v.x) ) + ( (v.y) * (v.y) ) +( (v.z) * (v.z) ); \
    float lengthSqrauedI = 1.0f / asm_sqrt(tempLength); \
    out.x = (v.x) * lengthSqrauedI;              \
    out.y = (v.y) * lengthSqrauedI;              \
    out.z = (v.z) * lengthSqrauedI;              \
}

#define VEC3_CROSS(a, b, out) \
{                             \
    out.x = ( (a.y) * (b.z) ) - ( (a.z) * (b.y) ); \
    out.y = ( (a.z) * (b.x) ) - ( (a.x) * (b.z) ); \
    out.z = ( (a.x) * (b.y) ) - ( (a.y) * (b.x) ); \
}

void Mesh::RecalculateNormals()
{
        glm::vec3 e1;
        glm::vec3 e2;   
        glm::vec3 no;

        int iabc[3];
        for (int i = 0; i < indexCount; i += 3)
        {
            iabc[0] = indices[i];
            iabc[1] = indices[i + 1];
            iabc[2] = indices[i + 2];

            glm::vec4& tmp4a = vert[iabc[0]].position;
            glm::vec4& tmp4b = vert[iabc[1]].position;
            glm::vec4& tmp4c = vert[iabc[2]].position;
                        
            VEC3_SUB(tmp4a, tmp4b, e1);
            VEC3_SUB(tmp4c, tmp4b, e2);

            VEC3_CROSS(e1, e2, no);

            VEC3_ADD(vert[iabc[0]].normal, no, vert[iabc[0]].normal);
            VEC3_ADD(vert[iabc[1]].normal, no, vert[iabc[1]].normal);
            VEC3_ADD(vert[iabc[2]].normal, no, vert[iabc[2]].normal);
        }

        for (int i = 0; i < vertexCount; i++)
        {
            VEC3_NORMALIZE(vert[i].normal, vert[i].normal);
        }

}
Jaysmito Mukherjee
  • 1,467
  • 2
  • 10
  • 29
  • 1
    Interesting. So I guess `glm` function calls was not inlined (or the glm used a bad implementation) regarding the performance results. I think the `asm_sqrt` function is not needed as using the right compiler optimizations should produce this. Note that since you are ok to use a lower precision sqrt, further optimization are possible. First the fsqrt instruction is obsolete and slow (for very old x87 processors), please use at least the sqrtps one which can be up to twice faster on new hardware (although exceptions are handled differently) like on AMD Zen3 processors. – Jérôme Richard Feb 06 '22 at 15:25
  • 1
    Moreover, you could use an approximate reciprocal sqrt instruction (ie. `1/sqrt(v)` with a little less precision) available in x86-64 hardware: the rsqrtps instruction. It as a throughout of 0.5~1 cycle compared to 4~22 cycle of the fsqrt one on modern processors. It should be about 4~5 times faster on a 5 year old Intel processor and at least 20~44 times faster on a 5 year-old AMD one. Note that float/double values need to be loaded in xmm registers and you can use C simd instrinsics rather than low-level assembly. This should be enough to speed up significantly the overall code ;) . – Jérôme Richard Feb 06 '22 at 15:35
  • @JérômeRichard sadly i can no longer use inline assembly as i just moved x64 and inline assembly is not supported so i moved to the [Quake3 fast inverse sqrt](https://github.com/Jaysmito101/TerraForge3D/blob/0f42550c7e0982b633e93f48360e7ea224f9ad44/TerraForge3D/src/Base/Mesh.cpp#L32) and yes i think there are some problems with glm math functions (I tried GLM_FORCE_INLINE still it was slow) or i might have done some error there. And for SIMD i am very new to it and am trying to avoid it for now as i tried a SIMD implementation and it is not working – Jaysmito Mukherjee Feb 06 '22 at 15:51
  • SIMD version i tried https://stackoverflow.com/questions/71005923/simd-code-for-mesh-normals-calculation-not-working-trying-to-convert-c-to-sim – Jaysmito Mukherjee Feb 06 '22 at 15:52
  • Also one thing i wanted to mention having all the glm math functions, constructors was too many function calls which i completely got rid of by using macros for math. And the clearing normals part i do on GPU(OpenCL). (I cannot do the normals generation on GPU because of the way the pipeline is designed as of now) – Jaysmito Mukherjee Feb 06 '22 at 15:54
  • This is surprising. Did you enabled compiler optimisations? If you use GCC/Clang/ICC, then you should use `-O3` and possibly `-march=native` (faster performance but less portable code) and `-ffast-math` (faster performance assuming there is no inf/NaN values). For MSVC it should be `/O2` and `/fp:fast` (and possibly `/Ob2`). – Jérôme Richard Feb 06 '22 at 16:31
  • 1
    i did /O2i in msvc – Jaysmito Mukherjee Feb 06 '22 at 18:12
  • There is /O2 and /Oi in MSVC but there is no /O2i in the [documentation](https://learn.microsoft.com/en-us/cpp/build/reference/o1-o2-minimize-size-maximize-speed?view=msvc-170). Is this a mistake? – Jérôme Richard Feb 06 '22 at 18:15
  • You should be able to use functions instead of macros. – Ben Feb 07 '22 at 01:23
  • 1
    @JérômeRichard we can group flags like that it's in the remarks part in the docs – Jaysmito Mukherjee Feb 07 '22 at 03:17
  • https://learn.microsoft.com/en-us/cpp/build/reference/o-options-optimize-code?view=msvc-170 – Jaysmito Mukherjee Feb 07 '22 at 03:17