I'm trying to extract texts from CAPTCHA pictures. The idea is to use lxml to get the image data from the form. The image data is prepended with a header that defines the data type. I'm guessing the CAPTCHA picture is a PNG image encoded in Base64. The image data is decoded from Base64 into the initial binary format. Meanwhile PIL
wraps the binary data with BytesIO
before it is passed to the PIL.Image
class.
Here is the snippet's first section.
import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
from io import BytesIO
import lxml.html
from PIL import Image
import pytesseract
def parse_form(html):
tree = lxml.html.fromstring(html)
data = {}
for e in tree.cssselect('form input'):
if e.get('name'):
data[e.get('name')] = e.get('value')
return data
REGISTER_URL = 'http://tracuunnt.gdt.gov.vn/tcnnt/mstdn.jsp'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(REGISTER_URL).read()
form = parse_form(html)
Here, this function raises OSError: cannot identify image file <_io.BytesIO object at 0x08B3B060>
:
def get_captcha(html):
tree = lxml.html.fromstring(html)
img_data = tree.cssselect('div img')[0].get('src')
img_data = img_data.partition('-')[-1]
binary_img_data = img_data.decode('base64')
file_like = BytesIO(binary_img_data)
img = Image.open(file_like)
return img
img = get_captcha(html)
I'm suspecting that it is the binary_img_data
variable. I've tried to read up on decoding, encoding, PIL doc, and binary data on how to PIL can possibly read a web-based image i.e CAPTCHA but got nothing helpful.