1

I entered a captcha-ed website I would like to get rid of. Here is some sample images

Captcha1

Captcha2

Since the background is static and the word is so computer-generated non distorted character, I believe it is very do-able. Since passing the image directly to Tesseract (OCR engine) doesn't come a positive result. I would like to remove the captcha background before OCR.

I tried multiple background removal methods using Python-PIL

  1. Remove all non-black pixels, which remove the lines but it wouldn't remove the small solid black box.
  2. Apply filter mentioned another StackOverflow post, which would not remove the small solid black box. Also it is less effective than method 1.

Method 1 and 2 would give me a image like this

enter image description here

It seems close but Tesseract couldn't recognize the character, even after the top and bottom dot row is removed.

  1. Create a background mask, and apply the background mask to the image.

Here is the mask image

enter image description here

And this is the image with the mask applied and grey lines removed

Background Mask

However blindly applying this mask would generate some "white holes" in the captcha character. And still Tesseract failed to find out the words.

Are there any better methods removing the static background?

Lastly how could I split the filtered image into 6 image with single character? Thanks very much.

Community
  • 1
  • 1
Winston
  • 1,308
  • 5
  • 16
  • 34

1 Answers1

0

I can give you a few ideas to have a try.

After you have applied step 3, you may thicken the black edges in the images using PIL so as the fill the white holes. And I guess you are using python-tesseract. If so, please refer to Example 4 in https://code.google.com/p/python-tesseract/wiki/CodeSnippets

In order to extract the characters, you may refer to Numpy PIL Python : crop image on whitespace or crop text with histogram Thresholds. There are methods about analysing the histogram of the image so as to locate the position of the whitespaces from which you can infer the boundary.

Community
  • 1
  • 1
Paco Wong
  • 680
  • 5
  • 12
  • Thanks Paco. I found that Tesseract is not good for OCR but I am using Scene OCR API server. It's good enough for my case. =) – Winston Feb 10 '15 at 07:43
  • For more details please check the example http://widu.tumblr.com/post/43624338495/ocr-of-an-image-from-a-link-using-python – Winston Feb 10 '15 at 07:43