Image detection features: SIFT, HISTOGRAM and EGDE

Question

I am working on developing a object classifier by using 3 different features i.e SIFT, HISTOGRAM and EGDE.

However these 3 features have different dimensional vector e.g. SIFT = 128 dimension. HIST = 256.

Now these features cannot be concatenated into once vector due to different sizes. What I am planning to do but I am not sure if it is going to be correct way is this:

For each features i train the classifier separately and than i apply classification separately for 3 different features and than count the majority and finally declare the image with majority votes.

Do you think this is a correct way?

lightalchemist · Accepted Answer · 2016-07-18T06:24:42.903

There are several ways to get classification results that take into account multiple features. What you have suggested is one possibility where instead of combining features you train multiple classifiers and through some protocol, arrive at a consensus between them. This is typically under the field of ensemble methods. Try googling for boosting, random forests for more details on how to combine the classifiers.

However, it is not true that your feature vectors cannot be concatenated because they have different dimensions. You can still concatenate the features together into a huge vector. E.g., joining your SIFT and HIST features together will give you a vector of 384 dimensions. Depending on the classifier you use, you will likely have to normalize the entries of the vector so that no one feature dominate simply because by construction it has larger values.

EDIT in response to your comment: It appears that your histogram is some feature vector describing a characteristic of the entire object (e.g. color) whereas your SIFT descriptors are extracted at local interest keypoints of that object. Since the number of SIFT descriptors may vary from image to image, you cannot pass them directly to a typical classifier as they often take in one feature vector per sample you wish to classify. In such cases, you will have to build a codebook (also called visual dictionary) using the SIFT descriptors you have extracted from many images. You will then use this codebook to help you derive a SINGLE feature vector from the many SIFT descriptors you extract from each image. This is what is known as a "bag of visual words (BOW)" model. Now that you have a single vector that "summarizes" the SIFT descriptors, you can concatenate that with your histogram to form a bigger vector. This single vector now summarizes the ENTIRE image/(object in the image).

For details on how to build the bag of words codebook and how to use it to derive a single feature vector from the many SIFT descriptors extracted from each image, look at this book (free for download from author's website) http://programmingcomputervision.com/ under the chapter "Searching Images". It is actually a lot simpler than it sounds.

Roughly, just run KMeans to cluster the SIFT descriptors from many images and take their centroids (which is a vector called a "visual word") as the codebook. E.g. for K = 1000 you have a 1000 visual word codebook. Then, for each image, create a result vector the same size as K (in this case 1000). Each element of this vector corresponds to a visual word. Then, for each SIFT descriptor extracted from an image, find its closest matching vector in the codebook and increment the count in the corresponding cell in the result vector. When you are done, this result vector essentially counts how often the different visual words appear in the image. Similar images will have similar counts for the same visual words and hence this vector effectively represents your images. You will also need to "normalize" this vector to make sure that images with different number of SIFT descriptors (and hence total counts) are comparable. This can be as simple as simply dividing each entry by the total count in the vector or through a more sophisticated measure such as tf/idf as described in the book.

I believe the author also provide python code on his website to accompany the book. Take a look or experiment with them if you are unsure.

More sophisticated method for combining features include Multiple Kernel Learning (MKL). In this case, you compute different kernel matrices, each using one feature. You then find the optimal weights to combine the kernel matrices and use the combined kernel matrix to train a SVM. You can find the code for this in the Shogun Machine Learning Library.

Thanks for great answer to my question. For the method you mentioned MKL it looks interesting as i want not aware of this. For the second method you mentioned where we combined. I am using OpenCV and for example one image using sift has [128 x 34] feature vector size and Histogram has [256 x 1]. I tried to combine them but I couldn't. That is why i felt that it cannot be combined. Did i do something wrong? Once again thank you so much. — rish, Jul 31 '13 at 10:32
I added details in my answer response to your reply. Btw, even with MKL, you would likely need to reduce your many SIFT descriptors to a single vector to use that. Part of the reason is that different images will give your different number of SIFT descriptors. Unless you are matching the SIFT descriptors directly, a classifier cannot typically handle different number of feature vectors for each sample (i.e. image). — lightalchemist, Jul 31 '13 at 15:31
@ lightalchemist , Thanks alot that was sweet description and took me just few hours to learn and implement it. Thank you so much. I like the idea of MKL but since i have no idea about it, I will try out after i learn the idea and try to see how it goes. I am using SVM for classification after creating the BOW model, it seems to work perfect. You stated about Boosting and Random Forest but I kind of like the libSVM so I used that. Is ok to use SVM with BOW? I think its good but just classifying. Sorry to ask so much question but you have very good explanation that i cant help asking..:P — rish, Aug 01 '13 at 11:53
@rish Thanks for the kind words. Regarding using SVM for BOW, sure. BOW is just a way to represent your data. After that you can use any classifier to classify your data, be it SVM, Random Forest... Regarding MKL, I suggest understanding how SVMs work, specifically the use of Kernels (matrix) in SVMs. The (i, j) entry of the kernel matrix essentially stores a measure of similarity between your ith and jth sample. MKL basically figures out how to optimally combine multiple kernel matrices (one created for each feature) and use it in the SVM. Check out Shogun Machine learning library for details — lightalchemist, Aug 01 '13 at 13:01
Thanks alot. All is working perfectly for me. So excited. I think this is one last question. I am using sift for feature extraction, but I tried to increase the number of keypoints using Harris Affine key points and it works well and improves my results. But my friend told me it cannot go with SIFT so i came here to ask you :P, since its on the same problem i discussed. And yes BOW makes the work so easy so thanks so so much. I did a research and tried sift + harris afine keypoints and i think is ok. Could confirm this last question of mine..Thanks so much for your great help. — rish, Aug 04 '13 at 14:05
Firstly, there is no part of SIFT's *descriptor* formulation that explicitly requires it to be used with its own *keypoint*. The image retrieval literature is rife with research on finding good keypoints to extract descriptors (not necessarily SIFT) from, e.g. MSER detector (see http://www.robots.ox.ac.uk/~vgg/research/affine/). Having said that, the choice of a good keypoints detector is crucial to attaining good results. However, depending on your dataset, some keypoint detector with certain parameter settings may *fail to detect* keypoints, hence causing the BOW model to perform poorly. — lightalchemist, Aug 04 '13 at 14:41
(cont.) Hence, even if a standard Harris corner detector may not be as good as SIFT or MSER at detecting "stable" keypoints, its default setting may be able to allow it to detect keypoints in images that your *parameter settings* for SIFT cannot (e.g. low res or overly smooth images). Hence it may not be that Harris is better, just that its settings are set more appropriately for your images. In general SIFT and MSER are more discriminative about which keypoints they find than Harris. Debug by drawing detected SIFT keypoints on your images to see if there are *too few* keypoints detected. — lightalchemist, Aug 04 '13 at 14:45

Image detection features: SIFT, HISTOGRAM and EGDE

1 Answers1

Linked