1

I have a whole bunch of image strings in base64 format for png images. They are phone numbers (see http://www.trulia.com/profile/gerald-drexler-broker-neillsville-wi-10703037/overview for my working example, using the src tag from the number). I would like to run them through pytesseract to extract the numbers.

I took some guidance from the answers here: Loading Base64 String into Python Image Library

I tried several formulations, and I can't seem to figure out how to load the string correctly into PIL to run pytesseract on it. Here's an example of an attempt:

from PIL import Image
import base64
import pytesseract
import cStringIO

imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = cStringIO.StringIO()
image_string = cStringIO.StringIO(base64.b64decode(imgstring))
image = Image.open(image_string)
image.save('pic.png', image.format, quality = 100)
picture = Image.open('pic.png', mode='r')
picture.load()
picture.seek(0)

print pytesseract.image_to_string(Image.open(picture))

It seems to me that I must be going about this the hard way, but even after saving, loading, etc., I still get an AttributeError: read

What's the most efficient way to load these into memory for pytesseract to chew them up? I haven't even gotten to the tesseract stage, and I have no idea how fast or slow it is, but I have millions of these to process.

Traceback (most recent call last):
  File "C:\Users\Jeff\Desktop\QS2\tess.py", line 16, in <module>
    print pytesseract.image_to_string(Image.open(picture))
  File "C:\Python27\lib\site-packages\PIL\Image.py", line 2223, in open
    prefix = fp.read(16)
  File "C:\Python27\lib\site-packages\PIL\Image.py", line 605, in __getattr__
    raise AttributeError(name)
AttributeError: read
Community
  • 1
  • 1
Xodarap777
  • 1,358
  • 4
  • 19
  • 42
  • Could you give us the full traceback? I currently have no idea what is causing the attribute error. – IronManMark20 Jun 27 '15 at 03:53
  • Traceback added. Is that to say that this code works for you? I've heard that PIL has issues in its Windows installations, would be interesting if that's the issue, here. – Xodarap777 Jun 27 '15 at 04:21
  • Isn't `picture` already an PIL.Image object? The argument to Image.open() is meant to be a file object or string. Perhaps change line 16 to `print pytesseract.image_to_string(picture)`. – anjsimmo Jun 27 '15 at 04:42
  • I tried that first - in that one I get `WindowsError: [Error 2] The system cannot find the file specified` - full traceback is too long. The advantage of saving the file and opening it is that I can be sure that the file exists and looks fine. The pic.png file is indeed a phone number png in the correct directory. – Xodarap777 Jun 27 '15 at 04:50
  • Pytesseract writes a temporary image to disk before calling tesseract-ocr in [pytesseract.py](https://github.com/madmaze/pytesseract/blob/master/src/pytesseract.py#L128). If efficiency is an issue, you may have more luck with something like https://code.google.com/p/python-tesseract/ which claims not to write any temporary files. – anjsimmo Jun 30 '15 at 01:51

2 Answers2

3

Issue for Python 3.*

from PIL import Image
import base64
import pytesseract
import io

imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = io.StringIO()
image_string = io.BytesIO(base64.b64decode(imgstring))
image = Image.open(image_string)

# Overlay on white background, see http://stackoverflow.com/a/7911663/1703216
bg = Image.new("RGB", image.size, (255,255,255))
bg.paste(image,image)

print(pytesseract.image_to_string(bg))

# Save the image passed to pytesseract for debugging purposes
bg.save('pic.png')
2

The PNG transparency seems to be causing issues. Overlaying on a white background fixes the issue.

from PIL import Image
import base64
import pytesseract
import cStringIO

imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = cStringIO.StringIO()
image_string = cStringIO.StringIO(base64.b64decode(imgstring))
image = Image.open(image_string)

# Overlay on white background, see http://stackoverflow.com/a/7911663/1703216
bg = Image.new("RGB", image.size, (255,255,255))
bg.paste(image,image)

print pytesseract.image_to_string(bg)

# Save the image passed to pytesseract for debugging purposes
bg.save('pic.png')
anjsimmo
  • 704
  • 5
  • 18
  • Pasting this exactly, I still get `WindowsError: [Error 2] The system cannot find the file specified` – Xodarap777 Jun 28 '15 at 23:55
  • 1
    I tested it on Ubuntu. I'd recommend testing that the basic usage examples for [tesseract-ocr](https://code.google.com/p/tesseract-ocr/wiki/ReadMe#Running_Tesseract) and [pyteseract](https://github.com/madmaze/pytesseract/blob/master/README.md) work first. Pytesseract is just a thin wrapper around tesseract-ocr, so won't work without tesseract-ocr. – anjsimmo Jun 30 '15 at 01:40
  • I failed to properly add the tesseract-ocr program file to the system's path variable. – Xodarap777 Jul 24 '15 at 04:55