Fastest HOG Feature Extraction implementation?

Question

Question
What's the fastest open-source HOG extraction code for multicore CPUs?

Motivation
I'm working on a real-time object detection application. Specifically, I've developed a variant of Deformable Parts Model cascades, targeting 30fps object detection. I've reached a point where extracting HOG features is more expensive than the rest of my pipeline, combined. I'm using the [Felzenzwalb, Girshick, et al] parameters for HOG extraction. That is, a multiresolution pyramid of HOG descriptors, and each descriptor has a total of 32 bins for orientation and a few other cues.

Goals
I'd like to do multiscale HOG feature extraction at 60fps (16ms) for 640x480 images on a multicore CPU.

Related Work
I've benchmarked a few off-the-shelf multiscale HOG implementations on a 6-core Intel 3930k CPU. For a 640x480 image, I observe the following performance numbers:

HOG in Dubout's FFLD DPM code: 19fps (52ms) -- C++ with OpenMP, but no vectorization
HOG in voc-release5 DPM code: 2.4fps (410ms) -- singlethreaded C++, plus a Matlab wrapper

I've also experimented with the OpenCV HOG extraction code. The OpenCV version works, but it seems to be hard-coded for Dalal-Triggs' HOG setup, and OpenCV doesn't seem to allow me to use the same HOG parameters (normalization scheme, binary position features, etc) as [Felzenzwalb, Girshick, et al]. The OpenCV version also doesn't natively support multiscale HOG, though you could do the downsampling yourself and call OpenCV HOG for each scale. I don't remember what the OpenCV HOG performance looked like.

Final Thoughts

The fastest HOG implementation -- FFLD -- seems to leave a lot of performance on the table. I haven't done a GFLOP/s estimate, but I do notice that FFLD's HOG code doesn't use any SSE/AVX vectorization. There isn't that much control flow, so vectorization seems like a cheap speedup opportunity here.
I haven't mentioned GPU HOG implementations here. I've experimented with groundHOG/CUHOG and fasthog. The CUHOG authors claim 20fps (50ms) HOG extraction on an NVIDIA GTX560. But, Intel CPUs are the target platform for my application, and copying a full HOG pyramid from the GPU to CPU is prohibitively expensive.

OpenCV includes the Dalal's implementation of HOG both in CPU and GPU versions. They work pretty good in my opinion, and they can be easily used for object detection with the OpenCV's CvSVM. — marcos.nieto, Oct 25 '13 at 17:10
The filter convolution is the most expensive part in DPM so how do you manage this part? — Mickey Shine, Jun 11 '14 at 14:35
@MickeyShine the usual stuff... massively quantizing the features, and doing cascades. I'm doing more deep learning and less HOG-based DPMs these days. But I reached a point where I could the convolutions for a HOG-based 3-component, 8-part-per-component model in well under 50ms. — solvingPuzzles, Jun 11 '14 at 16:15
@solvingPuzzles there is a link rot at where you point FFLD. — Bedir Yilmaz, Feb 03 '15 at 23:35
Just adding a couple of updated links [ffld](https://github.com/fanxu/ffld) and [ffld2](https://github.com/fanxu/ffld2). Seems to have moved again — Jon, Mar 09 '15 at 08:29
@solvingPuzzles it think its best to give the [GitHub](https://github.com/fanxu/ffld2) link. — Bedir Yilmaz, Apr 23 '15 at 10:29
I think this page could help you: http://www.geocities.ws/talh_davidc/ — SomethingSomething, Jan 07 '16 at 22:16

score 1 · Answer 1 · answered Nov 12 '13 at 08:17

1

Have a look at the following implementation HoG SSE

It does fit your time requirements. It is written in C and uses 128 bit long SIMD instructions.

The code can be also further customized depending on normalization strategy and output type you need.

I would be glad to hear your feedback and be able to improve this code.

answered Nov 12 '13 at 08:17

ivan_a

613
5
12

Interesting! I'll give this a try. Does it do multiscale extraction (a "HOG pyramid," as some people call it)? – solvingPuzzles Nov 26 '13 at 06:28
1

@solvingPuzzles, did the HoG SEE fit your time requeriments? which solution did you find? – Tin Feb 24 '14 at 11:32
@ivan_a could you please explain, how to use this code? I see that it uses only 16 bins and it is written that you can't change this? What does that mean? – Autonomous Aug 17 '14 at 09:28

Fastest HOG Feature Extraction implementation?

1 Answers1