I want to find the occurrence of a particular word in any webpage given as a input. I used Pyramid-Sliding window , where I generated HOG(Histogram of Gradients) features for all the sliding windows.
For now , I am comparing the HOG features of all windows with the HOG features of the word I want to extract.
For comparison of the two HOG feature vectors, I am just taking summation(vector1(i) - vector2(i)) for all i.
However, the results are below expectations.
My query is that can there be a better comparison system for comparing the HOG-features of each window with that of the word I want to find. Or should I train a classifier like SVM , to classify the HOG-features of a window.
For training the classifier, I can have max 100-200 elements for the word I want to find in my data-set. And since for SVM , its better to have equal number of true and false data elements in the data-set , how to restrict the non word representations(false elements) to 100-200. For non-word data elements in the training set, I have :
1. ICDAR-2003 (this word data-set do not contain the word I want to extract)
2. CIFAR image data set
The reason I am not extracting/finding this word in the html code, is because the word can occur in an image also.
Moreover, since the word I want to find is fixed, how many images of the word should I have in the data-set.