Question
What's the fastest open-source HOG extraction code for multicore CPUs?
Motivation
I'm working on a real-time object detection application. Specifically, I've developed a variant of Deformable Parts Model cascades, targeting 30fps object detection. I've reached a point where extracting HOG features is more expensive than the rest of my pipeline, combined. I'm using the [Felzenzwalb, Girshick, et al] parameters for HOG extraction. That is, a multiresolution pyramid of HOG descriptors, and each descriptor has a total of 32 bins for orientation and a few other cues.
Goals
I'd like to do multiscale HOG feature extraction at 60fps (16ms) for 640x480 images on a multicore CPU.
Related Work
I've benchmarked a few off-the-shelf multiscale HOG implementations on a 6-core Intel 3930k CPU. For a 640x480 image, I observe the following performance numbers:
- HOG in Dubout's FFLD DPM code: 19fps (52ms) -- C++ with OpenMP, but no vectorization
- HOG in voc-release5 DPM code: 2.4fps (410ms) -- singlethreaded C++, plus a Matlab wrapper
I've also experimented with the OpenCV HOG extraction code. The OpenCV version works, but it seems to be hard-coded for Dalal-Triggs' HOG setup, and OpenCV doesn't seem to allow me to use the same HOG parameters (normalization scheme, binary position features, etc) as [Felzenzwalb, Girshick, et al]. The OpenCV version also doesn't natively support multiscale HOG, though you could do the downsampling yourself and call OpenCV HOG for each scale. I don't remember what the OpenCV HOG performance looked like.
Final Thoughts
- The fastest HOG implementation -- FFLD -- seems to leave a lot of performance on the table. I haven't done a GFLOP/s estimate, but I do notice that FFLD's HOG code doesn't use any SSE/AVX vectorization. There isn't that much control flow, so vectorization seems like a cheap speedup opportunity here.
- I haven't mentioned GPU HOG implementations here. I've experimented with groundHOG/CUHOG and fasthog. The CUHOG authors claim 20fps (50ms) HOG extraction on an NVIDIA GTX560. But, Intel CPUs are the target platform for my application, and copying a full HOG pyramid from the GPU to CPU is prohibitively expensive.