Slow cloud processing. How to use all CPU cores in grabber callback?

Question

I am working with Kinect2Grabber and want to do some real-time processing, but the performance I get is around 1 fps. Ultimately I want to estimate a trajectory of a ball thrown in the air. I cannot do it with such slow processing :(

I am using Windows 10 64-bit, Visual Studio 2017, PCL 1.9.1 All-in-one Installer MSVC2017 x64 on a AMD Ryzen Threadripper 1900X 8-Core Processor. OpenMP is enabled in my project and so are optimizations. However, when I run my program, the CPU usage for it is around 12-13%. What am I doing wrong?

int main(int argc, char* argv[])
{
    boost::shared_ptr<visualization::PCLVisualizer> viewer(new visualization::PCLVisualizer("Point Cloud Viewer"));
    viewer->setCameraPosition(0.0, 0.0, -1.0, 0.0, 0.0, 0.0); 

    PointCloud<PointType>::Ptr cloud(new PointCloud<PointType>);

    // Retrieved Point Cloud Callback Function
    boost::mutex mutex;
    boost::function<void(const PointCloud<PointType>::ConstPtr&)> function = [&cloud, &mutex](const PointCloud<PointType>::ConstPtr& ptr) {
        boost::mutex::scoped_lock lock(mutex);

        //Point Cloud Processing 
        cloud = ptr->makeShared();
        std::vector<int> indices;
        removeNaNFromPointCloud<PointType>(*cloud, *cloud, indices);
        pass_filter(0.5, 0.90, cloud);
        outlier_removal(50, 1.0, cloud);
        downsampling_vox_grid(0.005f, cloud);
        normals(0.04, cloud, cloud_normals);
        segmentation(cloud, cloud_normals);
    };

    boost::shared_ptr<Grabber> grabber = boost::make_shared<Kinect2Grabber>();
    boost::signals2::connection connection = grabber->registerCallback(function);
    grabber->start();

    while (!viewer->wasStopped()) {
        // Update Viewer
        viewer->spinOnce();
        boost::mutex::scoped_try_lock lock(mutex);
        if (lock.owns_lock() && cloud) {
            // Update Point Cloud
            if (!viewer->updatePointCloud(cloud, "chmura")) {
                viewer->addPointCloud(cloud, "chmura");
            }
        }
    }

    grabber->stop();

    // Disconnect Callback Function
    if (connection.connected()) {
        connection.disconnect();
    }
    return 0;
}

The omitted code for pass_filter, outlier_removal etc is taken directly from the tutorials and it is working, but very slow starting from outlier_removal (inclusive). Your help will be greatly apprieciated.

I do not have to use Kinect2Grabber. Anything will be good to grab and process frames from Kinec2 on Windows.

score 0 · Answer 1 · answered Jul 18 '19 at 04:51

I see a few mitigations for your issues. 12-13% usage for Threadripper sounds good (100/16 is ~6.25%) as this implies full use of 1 physical core (1 thread for IO and 1 for computation? Just my guess).

To get better performance, you need to profile in order to understand what's causing the bottleneck. Perf is a great tool for that. There's an awesome video about how to profile code by Chandler. This is not the best place for a perf tutorial, but TL;DW:

compile using -fno-omit-frame-pointer flag or equivalent
perf record -g <executable>
perf report -g to know which functions take most CPU cycles
perf report -g 'graph,0.5,caller' to know which call-paths take most CPU cycles

Most likely, the issue identified would be

Repeated creation of single-use objects such as VoxelGrid: Instantiating objects just once is a better use of your CPU cycles
Lock and IO for the grabber

This will get you slightly more frame rate but still limit your CPU utilization to 12-13% aka the single-thread limit.

Since you are using a ThreadRipper, you might use threads to use the other CPU and decouple IO, computation and visualization

One thread for the grabber which grabs frames and pushes them into a Queue
Another thread to consume frames from the Queue based on CPU availability. This saves the computed data into yet another Queue
Visualization thread which takes data from the output Queue

This allows you to tune the Queue sizes to drop frames to improve latency. This might not be required based on the design of your custom Kinect2Grabber. This will be evident after you profile your code.

This has potential to dramatically reduce latency as well as improve frame-rates by perhaps increasing CPU utilization to 20% (because the grabber and visualization threads will be working at full throttle)

In order to fully utilize all the threads, the consumer thread can offload frames to other threads to allow CPU to process multiple frames at once. For this, you can adopt the following options (and they aren't exclusive. You can use option 2 as the workhorse for option 1 for complete benefit)

For parallel executing multiple independent functions at once (eg: your workhorse lambda), look at boost::thread_pool or boost::thread_group
For a pipeline based model (where every stage is run in a different thread), you can use frameworks like TaskFlow

Option 1 is suitable when you're not going for a specific metric like latency. It also requires least change to your code, but has trade-off between high latency (current design issue pointed above) or creating one copy of object per-thread for preventing setup every time.

Option 2 is suitable when the latency required is low and the frames need to be ordered. However, in order to maintain low latency in TaskFlow, you'd need to feed the frames at the rate your CPU cores can process them. Else, you can overload the CPU, leave no free RAM causing page thrashing, and actually reduce performance

Both of these options require you to ensure that the output arrives in the correct order. This can be done using promises or a Queue which drops out of order frames. By implementing part of these solutions or all, you can ensure that your CPU utilization remains high, maybe even 100%.

In order to know what fits your situation the best, know what your target performance metrics are, profile the code intelligently and test without bias.

Without metrics, you can enter a deep hole of improving performance even if it's not required (eg: 5 fps might be sufficient instead of 60fps)
Without measuring, you can't know if you're getting the correct result or if you're indeed attacking the bottleneck.
Without unbiased testing, you can arrive at wrong conclusions.

Thank you for your suggestions. Too bad PCL is not internally parallelized (with some exceptions). I guess we will have to think of using a queue and processing incoming frames in separate threads. We should be able to process several frames before clogging the CPU and this hopefully should be enough. — Witek, Jul 18 '19 at 21:18
What do you mean by internally parallelized? On a fundamental level, adding parallel for loops might do more harm than good without knowing the use case. Though it might be possible to add C++14 style executor in the future to handle that in the future. — Kunal Tyagi, Jul 19 '19 at 03:55
I agree that not everything can be easily parallelized, but many routines can, I think. Routines that iterate over the points of a cloud. Take removeNAN for example. There is a bunch of for loops, which could be easily executed as parallel for. Same case with passthrough filter, outlier removal, etc. Or am I wrong? — Witek, Jul 19 '19 at 11:09

Slow cloud processing. How to use all CPU cores in grabber callback?

1 Answers1