2

I am trying to implement Bag of Words in opencv and has come with the implementation below. I am using Caltech 101 database. However, since its my first time and not being familiar, I have planned to used two image sets from the database, the chair image set and the soccer ball image set. I have coded for the svm using this.

Everything went allright, except when I call classifier.predict(descriptor) , I do not get the label vale as intended. I always get a0 instead of '1', irrespective of my test image. The number of images in the chair dataset is 10 and in the soccer ball dataset is 10. I labelled chair as 0 and soccer ball as 1 . The links represent the samples of each categories, the top 10 is of chairs, the bottom 10 is of soccer balls

function hello

    clear all; close all; clc;
        
    detector = cv.FeatureDetector('SURF');
    extractor = cv.DescriptorExtractor('SURF');
        
        
    links = {
    'http://i.imgur.com/48nMezh.jpg'
    'http://i.imgur.com/RrZ1i52.jpg'
    'http://i.imgur.com/ZI0N3vr.jpg'
    'http://i.imgur.com/b6lY0bJ.jpg'
    'http://i.imgur.com/Vs4TYPm.jpg'
    'http://i.imgur.com/GtcwRWY.jpg'
    'http://i.imgur.com/BGW1rqS.jpg'
    'http://i.imgur.com/jI9UFn8.jpg'
    'http://i.imgur.com/W1afQ2O.jpg'
    'http://i.imgur.com/PyX3adM.jpg'


    'http://i.imgur.com/U2g4kW5.jpg'
    'http://i.imgur.com/M8ZMBJ4.jpg'
    'http://i.imgur.com/CinqIWI.jpg'
    'http://i.imgur.com/QtgsblB.jpg'
    'http://i.imgur.com/SZX13Im.jpg'
    'http://i.imgur.com/7zVErXU.jpg'
    'http://i.imgur.com/uUMGw9i.jpg'
    'http://i.imgur.com/qYSkqEg.jpg'
    'http://i.imgur.com/sAj3pib.jpg'
    'http://i.imgur.com/DMPsKfo.jpg'
    };
 
       
    N = numel(links);
        
    trainer = cv.BOWKMeansTrainer(100);
          
        
    train = struct('val',repmat({' '},N,1),'img',cell(N,1), 'pts',cell(N,1), 'feat',cell(N,1));
        
            
    for i=1:N
            
      train(i).val = links{i};
      train(i).img = imread(links{i});
        
       if ndims(train(i).img > 2)
         train(i).img = rgb2gray(train(i).img);
       end;
                
       train(i).pts = detector.detect(train(i).img);
       train(i).feat = extractor.compute(train(i).img,train(i).pts);
            
     end;
        
     for i=1:N
          trainer.add(train(i).feat);
     end;
         
     dictionary = trainer.cluster();
     extractor = cv.BOWImgDescriptorExtractor('SURF','BruteForce');
     extractor.setVocabulary(dictionary);
        
     for i=1:N
          desc(i,:) = extractor.compute(train(i).img,train(i).pts);
     end;
        
     a = zeros(1,10)';
     b = ones(1,10)';
     labels = [a;b];
       
           
     classifier  = cv.SVM;
     classifier.train(desc,labels);
     
     test_im =rgb2gray(imread('D:\ball1.jpg'));
        
     test_pts = detector.detect(test_im);
     test_feat = extractor.compute(test_im,test_pts);
           
     val = classifier.predict(test_feat);
     disp('Value is: ')
     disp(val)
        
     end

These are my test samples:

Soccer Ball

Soccer Ball
(source: timeslive.co.za)

Chair

Chair

Searching through this site I think that my algorithm is okay, even though I am not quite confident about it. If anybody can help me in finding the bug, it will be appreciable.

Following Amro's code , this was my result:

Distribution of classes:
  Value    Count   Percent
      1       62     49.21%
      2       64     50.79%
Number of training instances = 61
Number of testing instances = 65
Number of keypoints detected = 38845
Codebook size = 100
SVM model parameters:
         svm_type: 'C_SVC'
      kernel_type: 'RBF'
           degree: 0
            gamma: 0.5063
            coef0: 0
                C: 62.5000
               nu: 0
                p: 0
    class_weights: 0
        term_crit: [1x1 struct]

Confusion matrix:

ans =

    29     1
     1    34

Accuracy = 96.92 %
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
motiur
  • 1,640
  • 9
  • 33
  • 61

2 Answers2

2

Your logic looks fine to me.

Now I guess you'll have to tweak the various parameters if you want to improve the classification accuracy. This includes the clustering algorithm parameters (such as the vocabulary size, clusters initialization, termination criteria, etc..), the SVM parameters (kernel type, the C coefficient, ..), the local features algorithm used (SIFT, SURF, ..).

Ideally, whenever you want to perform parameter selection, you ought to use cross-validation. Some methods already have such mechanism embedded (CvSVM::train_auto for instance), but for the most part you'll have to do this manually...

Finally you should follow general machine learning guidelines; see the whole bias-variance tradeoff dilemma. The online Coursera ML class discusses this topic in detail in week 6, and explains how to perform error analysis and use learning curves to decide what to try next (do we need to add more instances, increase model complexity, and so on..).

With that said, I wrote my own version of the code. You might wanna compare it with your code:

% dataset of images
% I previously saved them as: chair1.jpg, ..., ball1.jpg, ball2.jpg, ...
d = [
    dir(fullfile('images','chair*.jpg')) ;
    dir(fullfile('images','ball*.jpg'))
];

% local-features algorithm used
detector = cv.FeatureDetector('SURF');
extractor = cv.DescriptorExtractor('SURF');

% extract local features from images
t = struct();
for i=1:numel(d)
    % load image as grayscale
    img = imread(fullfile('images', d(i).name));
    if ~ismatrix(img), img = rgb2gray(img); end

    % extract local features
    pts = detector.detect(img);
    feat = extractor.compute(img, pts);

    % store along with class label
    t(i).img = img;
    t(i).class = find(strncmp(d(i).name,{'chair','ball'},4));
    t(i).pts = pts;
    t(i).feat = feat;
end

% split into training/testing sets
% (a better way would be to use cvpartition from Statistics toolbox)
disp('Distribution of classes:')
tabulate([t.class])
tTrain = t([1:7 11:17]);
tTest = t([8:10 18:20]);
fprintf('Number of training instances = %d\n', numel(tTrain));
fprintf('Number of testing instances = %d\n', numel(tTest));

% build visual vocabulary (by clustering training descriptors)
K = 100;
bowTrainer = cv.BOWKMeansTrainer(K, 'Attempts',5, 'Initialization','PP');
clust = bowTrainer.cluster(vertcat(tTrain.feat));

fprintf('Number of keypoints detected = %d\n', numel([tTrain.pts]));
fprintf('Codebook size = %d\n', K);

% compute histograms of visual words for each training image
bowExtractor = cv.BOWImgDescriptorExtractor('SURF', 'BruteForce');
bowExtractor.setVocabulary(clust);
M = zeros(numel(tTrain), K);
for i=1:numel(tTrain)
    M(i,:) = bowExtractor.compute(tTrain(i).img, tTrain(i).pts);
end
labels = vertcat(tTrain.class);

% train an SVM model (perform paramter selection using cross-validation)
svm = cv.SVM();
svm.train_auto(M, labels, 'SvmType','C_SVC', 'KernelType','RBF');
disp('SVM model parameters:'); disp(svm.Params)

% evaluate classifier using testing images
actual = vertcat(tTest.class);
pred = zeros(size(actual));
for i=1:numel(tTest)
    descs = bowExtractor.compute(tTest(i).img, tTest(i).pts);
    pred(i) = svm.predict(descs);
end

% report performance
disp('Confusion matrix:')
confusionmat(actual, pred)
fprintf('Accuracy = %.2f %%\n', 100*nnz(pred==actual)./numel(pred));

Here are the output:

Distribution of classes:
  Value    Count   Percent
      1       10     50.00%
      2       10     50.00%
Number of training instances = 14
Number of testing instances = 6

Number of keypoints detected = 6300
Codebook size = 100

SVM model parameters:
         svm_type: 'C_SVC'
      kernel_type: 'RBF'
           degree: 0
            gamma: 0.5063
            coef0: 0
                C: 312.5000
               nu: 0
                p: 0
    class_weights: []
        term_crit: [1x1 struct]

Confusion matrix:
ans =
     3     0
     1     2
Accuracy = 83.33 %

So the classifier correctly labels 5 out of 6 images from the test set, which is not bad for a start :) Obviously you'll get different results each time you run the code due to the inherent randomness of the clustering step.

Amro
  • 123,847
  • 25
  • 243
  • 454
  • Weird , thanks ... I don't how you came to know that much, really appreciable. This is really off-topic, did all of your expertise came from your college or from work; or both. Any thing that you would suggest to improve myself in CS in general. – motiur Dec 28 '13 at 15:40
  • @whoknows: you'd be surprised how much you can learn by hanging out here on Stack Overflow :) – Amro Dec 28 '13 at 16:07
  • I am really surprised with the result. – motiur Dec 29 '13 at 07:06
  • @whoknows: 97% accuracy is indeed a good result! I'm thinking it was the SVM "auto training" that did it for you (i.e performing cross-validation to select the SVM parameters).. – Amro Dec 29 '13 at 07:42
  • Just another thought, random one, www.kaggle.com has a lot of machine learning dataset that needs to be trained; any idea, how robust these methods will be in those cases, just a guess. I am aware of the heavy pre-processing step. – motiur Dec 29 '13 at 08:02
  • @whoknows: there is no reason why it shouldn't work, although I wouldn't expect winning any competition with just that :) As you said, those ML challenges usually require putting more thought into the pre-processing of the data, and extracting better features. Also the winning teams often employ multiple different algorithms combined using [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) techniques to get even better predictions... – Amro Dec 29 '13 at 08:21
  • Its just a remark, I have used 'ORB' and 'BruteForce-Hamming' here bowExtractor = cv.BOWImgDescriptorExtractor('ORB', 'BruteForce-Hamming') -- and also changed the places where 'SURF' was used. I am getting an error like this http://pastebin.com/raw.php?i=8ydddtcA . Does BOWImgDescriptorExtractor() does not support anything except SIFT and SURF. SIFT and SURF both works fine. – motiur Jan 25 '14 at 14:50
  • 1
    @whoknows: I haven't looked into this yet, but a quick search indicates that there is a potential issue with the BOW+ORB combination (really anything but SIFT or SURF like you said): http://answers.opencv.org/question/17460/how-to-use-bag-of-words-example-with-brief/, http://answers.opencv.org/question/24835/is-there-a-way-of-using-orb-with-bow/ ... – Amro Jan 25 '14 at 15:20
  • Well, this begs the question whether mexopencv has any cv::convert equivalent. I saw cv.cvtColor. Do you know any api for the conversion to CV32F. – motiur Jan 25 '14 at 16:05
  • @whoknows: you don't need an OpenCV functions for that, just do the casting in regular MATLAB (simply do `feat = single(feat);` whenever you compute a feature vectors). Unfortunately, a quick test shows that this still doesn't solve the problem. I'm guessing that `BOWImgDescriptorExtractor` will internally directly pass the uint8-typed ORB features to its matcher distance function (which from what I can tell only works on floating-points) not giving you the chance to cast them first... Ultimately this an OpenCV issue, and mexopencv can do nothing about. Consider filing a bug to OpenCV devs. – Amro Jan 25 '14 at 16:46
  • one of the advices suggested implementing your own `BOWImgDescriptorExtractor`: you have the computed clusters, and you have the descriptor extractor function, all you need is to extract the features set, and for each feature vector assign it to the closest cluster centroid (using the desired distance function). For each image, you manually collect how many times each of the clusters got assigned a vector then build a histogram. This will be the output of `BOWImgDescriptorExtractor`. If you wish to do it, you may find the `pdist2` or `knnsearch` functions from the Statistics toolbox useful... – Amro Jan 25 '14 at 17:02
  • Life is not easy, then ...ha! I will come back to it in a few days.Thanks, have a nice day. – motiur Jan 25 '14 at 17:06
0

What is the number of images you are using to build your dictionary i.e. what is N? From your code, it seems that you are only using a 10 images (those listed in links). I hope this list is truncated down for this post else that would be too few. Typically you need in the order of 1000 or much more images to build the dictionary and the images need not be restricted to only these 2 classes that you are classifying. Otherwise, with only 10 images and 100 clusters your dictionary is likely to be messed up.

Also, you might want to use SIFT as a first choice as it tends to perform better than the other descriptors.

Lastly, you can also debug by checking the detected keypoints. You can get OpenCV to draw the keypoints. Sometimes your keypoint detector parameters are not set properly, resulting in too few keypoints getting detected, which in turn gives poor feature vectors.

To understand more about the BOW algorithm, you can take a look at these posts here and here. The second post has a link to a free pdf for an O'Reilley book on computer vision using python. The BOW model (and other useful stuff) is described in more details inside that book.

Hope this helps.

Community
  • 1
  • 1
lightalchemist
  • 10,031
  • 4
  • 47
  • 55
  • Is my algorithm okay. I am bit fuzzy on the general algorithm used for bag of words. Also how many classes should I use, it takes long, so an approximate guess. Should I include all of the training sample? – motiur Dec 26 '13 at 10:41
  • @whoknows I added links to some answers that I gave in the past on BOW model. You might find the explanation therein useful for understanding how it works. Also, one of the posts contains a link to a useful book (with code in python) that contains details on how to code the BOW model. You might find them useful. On a separate note, if what you posted is exactly what you are running then it looks wrong. For one you need to extract descriptors from 1000s of images to build your dictionary. The size of the dictionary is usually decided through cross-validation(i.e. trial and error + plot results) – lightalchemist Dec 26 '13 at 13:41