I entered a captcha-ed website I would like to get rid of. Here is some sample images
Since the background is static and the word is so computer-generated non distorted character, I believe it is very do-able. Since passing the image directly to Tesseract (OCR engine) doesn't come a positive result. I would like to remove the captcha background before OCR.
I tried multiple background removal methods using Python-PIL
- Remove all non-black pixels, which remove the lines but it wouldn't remove the small solid black box.
- Apply filter mentioned another StackOverflow post, which would not remove the small solid black box. Also it is less effective than method 1.
Method 1 and 2 would give me a image like this
It seems close but Tesseract couldn't recognize the character, even after the top and bottom dot row is removed.
- Create a background mask, and apply the background mask to the image.
Here is the mask image
And this is the image with the mask applied and grey lines removed
However blindly applying this mask would generate some "white holes" in the captcha character. And still Tesseract failed to find out the words.
Are there any better methods removing the static background?
Lastly how could I split the filtered image into 6 image with single character? Thanks very much.