Efficiently sampling a 3D volume on the CPU

Question

I am doing some work with sdfs, evaluating them both on the GPU and the CPU. Some of my sdfs are analytic, some are grid based. explanation on what an sdf is.

A grid based sdf is just a 3D array with floating point values on each cell.

I implemented dual contouring to create meshes out of these sdfs. But I have a huge optimization issue. At the same DC resolution, an analytic sdf takes a minute and a half, a grid sdf takes hours.

The algorithm is exactly the same,only the Sdf evalaution function is different so the bottleneck is evaluating the grid sdf model (I verified this with a profiler).

I am trying to figure out how to reduce computation time for the following:

struct GridSdfMetaData
{
    float step_size;

    uint x_resolution;
    uint y_resolution;
    uint z_resolution;

    float x_origin;
    float y_origin;
    float z_origin;
};

inline float BilinearInterpolation(
    const std::array<float, 4> &values,
    const std::array<float, 2> &weights)
{
    const float m1 = std::lerp(values[0], values[1], weights[0]);
    const float m2 = std::lerp(values[2], values[3], weights[0]);

    return std::lerp(m1, m2, weights[1]);
}

inline float TrilinearInterpolation(
    const std::array<float, 8> &values,
    const std::array<float, 3> &weights)
{
    const float m1 = BilinearInterpolation({values[0], values[1], values[2], values[3]}, {weights[0], weights[1]});
    const float m2 = BilinearInterpolation({values[4], values[5], values[6], values[7]}, {weights[0], weights[1]});

    return std::lerp(m1, m2, weights[2]);
}

float SampleGridData(
    const Eigen::Vector3f &point,
    const GridSdfMetaData& grid_meta_data,
    const std::vector<float>& grid_data)
{
    using vec3 = Eigen::Vector3f;
    using ivec3 = Eigen::Vector3i;

    const vec3 relative_p =
        point - vec3(grid_meta_data.x_origin, grid_meta_data.y_origin, grid_meta_data.z_origin);
    const vec3 normalized_p = vec3(
        relative_p.x() / (grid_meta_data.step_size),
        relative_p.y() / (grid_meta_data.step_size),
        relative_p.z() / (grid_meta_data.step_size));

    ivec3 discrete_p = {normalized_p.x(), normalized_p.y(), normalized_p.z()};

    discrete_p.x() = clamp(int(discrete_p.x()), 0, int(grid_meta_data.x_resolution - 1));
    discrete_p.y() = clamp(int(discrete_p.y()), 0, int(grid_meta_data.y_resolution - 1));
    discrete_p.z() = clamp(int(discrete_p.z()), 0, int(grid_meta_data.z_resolution - 1));

    float v000 = ((float*)(grid_data.data()))[ //
        discrete_p.z() * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        discrete_p.y() * grid_meta_data.x_resolution +
        discrete_p.x()];

    float v001 = ((float*)(grid_data.data()))[ //
        discrete_p.z() * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        discrete_p.y() * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.x()) + 1, grid_meta_data.x_resolution - 1)];

    float v010 = ((float*)(grid_data.data()))[ //
        discrete_p.z() * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.y()) + 1, grid_meta_data.y_resolution - 1) * grid_meta_data.x_resolution +
        discrete_p.x()];

    float v011 = ((float*)(grid_data.data()))[ //
        discrete_p.z() * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.y()) + 1, grid_meta_data.y_resolution - 1) * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.x()) + 1, grid_meta_data.x_resolution - 1)];

    float v100 = ((float*)(grid_data.data()))[ //
        std::min(uint(discrete_p.z()) + 1, grid_meta_data.z_resolution - 1) * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        discrete_p.y() * grid_meta_data.x_resolution +
        discrete_p.x()];

    float v101 = ((float*)(grid_data.data()))[ //
        std::min(uint(discrete_p.z()) + 1, grid_meta_data.z_resolution - 1) * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        discrete_p.y() * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.x()) + 1, grid_meta_data.x_resolution - 1)];

    float v110 = ((float*)(grid_data.data()))[ //
        std::min(uint(discrete_p.z()) + 1, grid_meta_data.z_resolution - 1) * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.y()) + 1, grid_meta_data.y_resolution - 1) * grid_meta_data.x_resolution +
        discrete_p.x()];

    float v111 = ((float*)(grid_data.data()))[ //
        std::min(uint(discrete_p.z()) + 1, grid_meta_data.z_resolution - 1) * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.y()) + 1, grid_meta_data.y_resolution - 1) * grid_meta_data.x_resolution +
        std::min(uint(discrete_p.x()) + 1, grid_meta_data.x_resolution - 1)];

    float dx = normalized_p.x() - floor(normalized_p.x());
    float dy = normalized_p.y() - floor(normalized_p.y());
    float dz = normalized_p.z() - floor(normalized_p.z());

    return TrilinearInterpolation({v000, v001, v010, v011, v100, v101, v110, v111}, {dx, dy, dz});
}

Since this is what is consuming the vast majority of the DC time for grid sdfs (about 90%)

Since you are already including eigen in your project, why use a homebrew matrix for `grid_data`? Eigen does a bunch of optimizations under the hood. Leveraging them should be the first step here. — , May 20 '22 at 22:20
Inter compatibility with other tools. I guess I could try it, but I am not sure what meaningful optimizations eigen can do for a dense contiguous float array. — Makogan, May 20 '22 at 22:36

Spektre · Answer 1 · 2022-05-21T07:21:19.463

Here are just few things that stand out to me without any testing ...

You are mixing float and integer types a lot

for example this:
```
float v000 = ((float*)(grid_data.data()))[
 discrete_p.z() * grid_meta_data.y_resolution * grid_meta_data.x_resolution +
 discrete_p.y() * grid_meta_data.x_resolution +
 discrete_p.x()];
```
that means slow conversions between the types. I would use just integers For example take a look at my DDA

In case you need also float You can have a pair of float and int for each variable and evalue them separately so you do not mix them.
You using a lot of unnecessary multiplications

and doing them over and over again. My bet is your discrete_p is incrementing so you can just add instead of multiply again DDA is your friend...
Are you using direct array access or slow safe one with bound checking?

Not using std::array so I am not sure but the latter slows things down considerably so find out and make sure you use the direct (unsafe) access.
heap/stack trashing

When I look at this:
```
float SampleGridData(
 const Eigen::Vector3f &point,
 const GridSdfMetaData& grid_meta_data,
 const std::vector<float>& grid_data)
```
I got the impression that is called huge amount of times usually in some loop in such case its much faster to inline the function completely (or turn it into macro) or pass the operands and result as global variables or struct (context).

You are doing a lot of coordinate-wise computations

I would create a 3D inline functions or macros to do the stuff instead of calling 3times all the math stuff involved... For example:

 discrete_p.x() = clamp(int(discrete_p.x()), 0, int(grid_meta_data.x_resolution - 1));
 discrete_p.y() = clamp(int(discrete_p.y()), 0, int(grid_meta_data.y_resolution - 1));
 discrete_p.z() = clamp(int(discrete_p.z()), 0, int(grid_meta_data.z_resolution - 1));

Would be better like:

clamp3D(discrete_p);

And the clamp3D should not call any subfunction ...

Efficiently sampling a 3D volume on the CPU

1 Answers1