0

I want to find the occurrence of a particular word in any webpage given as a input. I used Pyramid-Sliding window , where I generated HOG(Histogram of Gradients) features for all the sliding windows.

For now , I am comparing the HOG features of all windows with the HOG features of the word I want to extract.

For comparison of the two HOG feature vectors, I am just taking summation(vector1(i) - vector2(i)) for all i.

However, the results are below expectations.

My query is that can there be a better comparison system for comparing the HOG-features of each window with that of the word I want to find. Or should I train a classifier like SVM , to classify the HOG-features of a window.

For training the classifier, I can have max 100-200 elements for the word I want to find in my data-set. And since for SVM , its better to have equal number of true and false data elements in the data-set , how to restrict the non word representations(false elements) to 100-200. For non-word data elements in the training set, I have :

1. ICDAR-2003 (this word data-set do not contain the word I want to extract)

2. CIFAR image data set

The reason I am not extracting/finding this word in the html code, is because the word can occur in an image also.

Moreover, since the word I want to find is fixed, how many images of the word should I have in the data-set.

user8788828
  • 11
  • 1
  • 1

1 Answers1

0

If you have fixed font and looking only for particular word, here is simple workaround:

https://stackoverflow.com/a/9647509/8682088

You have to extract word box, resize it to for example 40x10 pixels. Grayscale pixel values could be your feature vector. Then you could train your SVM. It is primitive, but suprisingly effective.

It work perfectly fine with fixed font and simple symbols.

Kamil Szelag
  • 708
  • 3
  • 12
  • Hey! That's the problem. I want to train a SVM, but I am confused regarding the data-set. The true training-examples can be multiple representations of the word I want to search ( though they will be almost the same). However false training examples can be anything, colored or white backgrounds , image components, other words etc. So while I can have a restricted amount of positive data( say 200-300 examples/images), how can I choose my negative data(data which is not the word I want to search) – user8788828 Oct 17 '17 at 16:53