-1

I'm trying to extract texts from CAPTCHA pictures. The idea is to use lxml to get the image data from the form. The image data is prepended with a header that defines the data type. I'm guessing the CAPTCHA picture is a PNG image encoded in Base64. The image data is decoded from Base64 into the initial binary format. Meanwhile PIL wraps the binary data with BytesIO before it is passed to the PIL.Image class. Here is the snippet's first section.

import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
from io import BytesIO
import lxml.html
from PIL import Image
import pytesseract

def parse_form(html):
    tree = lxml.html.fromstring(html)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data

REGISTER_URL = 'http://tracuunnt.gdt.gov.vn/tcnnt/mstdn.jsp'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(REGISTER_URL).read()
form = parse_form(html)

Here, this function raises OSError: cannot identify image file <_io.BytesIO object at 0x08B3B060>:

def get_captcha(html):
    tree = lxml.html.fromstring(html)
    img_data = tree.cssselect('div img')[0].get('src')
    img_data = img_data.partition('-')[-1]
    binary_img_data = img_data.decode('base64')
    file_like = BytesIO(binary_img_data)
    img = Image.open(file_like)
    return img

img = get_captcha(html)

I'm suspecting that it is the binary_img_data variable. I've tried to read up on decoding, encoding, PIL doc, and binary data on how to PIL can possibly read a web-based image i.e CAPTCHA but got nothing helpful.

Dzhud
  • 45
  • 9
  • Does this answer your question? ['str' object has no attribute 'decode'. Python 3 error?](https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error) – Ulrich Eckhardt Feb 20 '21 at 23:34

2 Answers2

0

To decode the base64 string, try the following:

from base64 import b64decode

binary_img_data = b64decode(img_data)

The method your code uses (img_data.decode('base64')) was valid in Python 2, but will not work in Python 3.

Tadeusz Sznuk
  • 994
  • 6
  • 9
  • Thank you but it gives the error `OSError: cannot identify image file <_io.BytesIO object at 0x0885B270>` – Dzhud Feb 21 '21 at 23:40
0

Totally overlooked the solution at the beginning. PILLOW couldn't read the image in binary data with that logic so I simply called the content of request.get() that bears the image's binary form and called Pillow to open it on the fly with BytesIO().

import lxml.html
import urllib.request as urllib2
from io import BytesIO
import lxml.html
from PIL import Image



   img_data = tree.cssselect('div img')[0].get('src')
   img_link = 'http://tracuunnt.gdt.gov.vn'+ img_data
   response = requests.get(img_link)
   img = Image.open(BytesIO(response.content))
Dzhud
  • 45
  • 9