1

I'm using OpenCV and I'm trying to recognize when a scan of a page of a book is already taken or not. I already looked at this post but didn't help me enough.

Currently I'm computing a 1:N SURF matching between the input image and all the other pages I scanned so far.

This method works pretty good, also by just taking a 192x192 square containing text, not the whole image, it's able to distinguish them.

I'd like to know if you think there is a faster method than this one, I thought about LSH so that I would have just to extract the features from the input image, hash the features in some way and then check if I reached a bucket already used or not.

So basically my question is, do you think that the method I described above could work? And if yes, how to do the hash function?

Thanks, .A

Community
  • 1
  • 1
Nazgul
  • 21
  • 3

1 Answers1

1

First thought would be a first pass that threw away impossible matches quickly and cheaply.

So something that simply did an image histogram of either the whole image or a set of windows - would let you discriminate half empty pages form full pages before doing a more expensive test.

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263
  • That would help yes, but in my case I'm mainly focusing on normal books, black and white papers and histogram wouldn't be so useful. – Nazgul Feb 15 '11 at 21:39
  • If the scans are reasonably orientated and the same scale a simple 1d slice through selected lines of text and then a 'tree ring' type match of edges might still be a good start. – Martin Beckett Feb 15 '11 at 21:58