How to ensure consistency in SIFT features?

Question

I am working with a classification algorithm that requires the size of the feature vector of all samples in training and testing to be the same.

I am also to use the SIFT feature extractor. This is causing problems as the feature vector of every image is coming up as a different sized matrix. I know that SIFT detects variable keypoints in each image, but is there a way to ensure that the size of the SIFT features is consistent so that I do not get a dimension mismatch error.

I have tried rootSIFT as a workaround:

[~, features] = vl_sift(single(images{i}));
        double_features = double(features);
        root_it = sqrt( double_features/sum(double_features) ); %root-sift
        feats{i} = root_it;

This gives me a consistent 128 x 1 vector for every image, but it is not working for me as the size of each vector is now very small and I am getting a lot of NaN in my classification result.

Is there any way to solve this?

AFAIK each SIFT feature ("point") should have 128 descriptors. I dont knoww how the code you are using works, but there must be a way of getting all of the descriptors for each point. — Ander Biguri, Feb 20 '15 at 15:07
Actually, looking to the documentation of `vl_sift`, the first output should be the point and the second one (`features` in your code) should be the descriptors of nx128 size, rigth? — Ander Biguri, Feb 20 '15 at 15:10
Yes, you are correct. The 128 (which is the size of each descriptor) remains the same but the number of descriptors is variable for each image, this is what I want to find a solution to somehow — StuckInPhDNoMore, Feb 20 '15 at 15:19

score 3 · Accepted Answer · edited Nov 23 '20 at 08:22

Using SIFT there are 2 steps you need to perform in general.

Extract SIFT features. These points (first output argument of size NPx2 (x,y) of your function) are scale invariant, and should in theory be present in each different image of the same object. This is not completely true. Often points are unique to each frame (image). These points are described by 128 descriptors each (second argument of your function).
Match points. Each time you compute features of a different image the amount of points computed is different! Lots of them should be the same point as in the previous image, but lots of them WON'T. You will have new points and old points may not be present any more. This is why you should perform a feature matching step, to link those points in different images. usually this is made by knn matching or RANSAC. You can Google how to perform this task and you'll have tons of examples.

After the second step, you should have a fixed amount of points for the whole set of images (considering they are images of the same object). The amount of points will be significantly smaller than in each single image (sometimes 30~ times less amount of points). Then do whatever you want with them!

Hint for matching: http://www.vlfeat.org/matlab/vl_ubcmatch.html

UPDATE:

You seem to be trying to train some kind of OCR. You would need to probably match SIFT features independently for each character.

How to use vl_ubcmatch:

[~, features1] = vl_sift(I1);
[~, features2] = vl_sift(I2);

matches=vl_ubcmatch(features1,features2)

Thank you, that does make more sense as to why I usually get varying number of keypoints at different runs. The matching you explained looks like what I might be looking for. First off, the images are not the same object rather different text samples from the same writer, using different words. Also the hint you linked doe snot have much explanation, what is ``descriptor1`` and ``descriptor2``? are they the descriptors from two runs of SIFT detection on the same image? or are they the descriptors of the train and text image respectively of my classifier ? — StuckInPhDNoMore, Feb 20 '15 at 16:43

dhanushka · Answer 2 · 2015-02-21T06:51:05.180

You can apply a dense SIFT to the image. This way you have more control over from where you get the feature descriptors. I haven't used vlfeat, but looking at the documentation I see there's a function to extract dense SIFT features called vl_dsift. With vl_sift, I see there's a way to bypass the detector and extract the descriptors from points of your choice using the 'frames' option. Either way it seems you can get a fixed number of descriptors.

If you are using images of the same size, dense SIFT or the frames option is okay. There's a another approach you can take and it's called the bag-of-features model (similar to bag-of-words model) in which you cluster the features that you extracted from images to generate codewords and feed them into a classifier.

How to ensure consistency in SIFT features?

2 Answers2