I have a whole bunch of image strings in base64 format for png images. They are phone numbers (see http://www.trulia.com/profile/gerald-drexler-broker-neillsville-wi-10703037/overview for my working example, using the src tag from the number). I would like to run them through pytesseract to extract the numbers.
I took some guidance from the answers here: Loading Base64 String into Python Image Library
I tried several formulations, and I can't seem to figure out how to load the string correctly into PIL to run pytesseract on it. Here's an example of an attempt:
from PIL import Image
import base64
import pytesseract
import cStringIO
imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = cStringIO.StringIO()
image_string = cStringIO.StringIO(base64.b64decode(imgstring))
image = Image.open(image_string)
image.save('pic.png', image.format, quality = 100)
picture = Image.open('pic.png', mode='r')
picture.load()
picture.seek(0)
print pytesseract.image_to_string(Image.open(picture))
It seems to me that I must be going about this the hard way, but even after saving, loading, etc., I still get an AttributeError: read
What's the most efficient way to load these into memory for pytesseract to chew them up? I haven't even gotten to the tesseract stage, and I have no idea how fast or slow it is, but I have millions of these to process.
Traceback (most recent call last):
File "C:\Users\Jeff\Desktop\QS2\tess.py", line 16, in <module>
print pytesseract.image_to_string(Image.open(picture))
File "C:\Python27\lib\site-packages\PIL\Image.py", line 2223, in open
prefix = fp.read(16)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 605, in __getattr__
raise AttributeError(name)
AttributeError: read