Tesseract OCR on AWS Lambda via virtualenv

Question

I have spent all week attempting this, so this is a bit of a hail mary.

I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python).

I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (e.g. /env/)

Doing pip install py-tesseract results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseract
Doing pip install tesseract-ocr gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency. However, I have no idea how to package up leptonica into /env/ (if that is even possible)

tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found
#include "leptonica/allheaders.h"

Downloading 0.9.1 python-tesseract egg file from https://bitbucket.org/3togo/python-tesseract/downloads and doing easy_install also errors out when looking for dependencies

Processing dependencies for python-tesseract==0.9.1
Searching for python-tesseract==0.9.1
Reading https://pypi.python.org/simple/python-tesseract/
Couldn't find index page for 'python-tesseract' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for python-tesseract==0.9.1

Any pointers would be greatly appreciated.

José Augusto Paiva · Accepted Answer · 2018-11-27T13:05:00.440

The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.

1) Start an EC2 instance with 64-bit Amazon Linux;

2) Install dependencies:

sudo yum install gcc gcc-c++ make
sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel

3) Compile and install leptonica:

cd ~
mkdir leptonica
cd leptonica
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
sudo make install

4) Compile and install tesseract

cd ~
mkdir tesseract
cd tesseract
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd tesseract-3.04.01
./autogen.sh
./configure
make
sudo make install

5) Download language traineddata to tessdata

cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/

At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.

6) Zip all the stuff you need to run tesseract on lambda

cd ~
mkdir tesseract-lambda
cd tesseract-lambda
cp /usr/local/bin/tesseract .
mkdir lib
cd lib
cp /usr/local/lib/libtesseract.so.3 .
cp /usr/local/lib/liblept.so.5 .
cp /usr/lib64/libpng12.so.0 .
cd ..

mkdir tessdata
cd tessdata
cp /usr/local/share/tessdata/eng.traineddata .
cd ..

cd ..
zip -r tesseract-lambda.zip tesseract-lambda

The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.

7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip:

from __future__ import print_function

import urllib
import boto3
import os
import subprocess

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')

s3 = boto3.client('s3')

def lambda_handler(event, context):

    # Get the bucket and object from the event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')

    try:
        print("Bucket: " + bucket)
        print("Key: " + key)

        imgfilepath = '/tmp/image.png'
        jsonfilepath = '/tmp/result.txt'
        exportfile = key + '.txt'

        print("Export: " + exportfile)

        s3.download_file(bucket, key, imgfilepath)

        command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
            LIB_DIR,
            SCRIPT_DIR,
            SCRIPT_DIR,
            imgfilepath,
            jsonfilepath,
        )

        try:
            output = subprocess.check_output(command, shell=True)
            print(output)
            s3.upload_file(jsonfilepath, bucket, exportfile)
        except subprocess.CalledProcessError as e:
            print(e.output)

    except Exception as e:
        print(e)
        print('Error processing object {} from bucket {}.'.format(key, bucket))
        raise e

When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.

IMPORTANT

From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).

Thanks for the comment, SergioArcos.

Is something like this what I would need to do to get the ODBC driver to be found when trying to use pypyodbc to connect to an RDS instance of SQL Server? pypyODBC raises an error when it can't find an ODBC library.(driver?) — Ryan Jones, Sep 29 '16 at 17:14
Probably. I never used pyODBC, that's why I'm not really sure, but, what I usually do when trying to run any software with library dependencies on AWS Lambda is: - Setup the library on an Amazon Linux machine (EC2); - Then, I copy all the library files that are needed to the lib folder of the lambda function (you can use **lld** command to have a list of shared object dependencies); — José Augusto Paiva, Oct 10 '16 at 18:41
Small correction to Jose's answer: the tool to get a list of shared dependencies is **ldd** — Dmitry Kolomiets, Jan 12 '17 at 12:58
In the mean time the training data in the master branch ist for tesseract 4.0, which doesn't work with 3.04. For 3.04 you need to get the training data from https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata — ssindelar, Mar 25 '17 at 06:07
Anyone else getting a similar error to `errorMessage": "Command 'LD_LIBRARY_PATH=/var/task/lib TESSDATA_PREFIX=/var/task /var/task/tesseract -v' returned non-zero exit status -11`. I tried printing the output of `tesseract -v` instead of OCRing any real files, runs fine on EC2. — AsianYayaToure, May 10 '17 at 01:29
I would check the tesseract binary you are uploading for your lambda function. Does it run fine on EC2 on an 64-bit Amazon Linux instance ? — José Augusto Paiva, May 10 '17 at 02:11
I might be a bit late. Lambda already has /var/task/lib added to the LD_LIBRARY_PATH, so it is not necessary to set it up and doing so prevents finding another libraries that might be necessary — Gabriel, Jun 20 '17 at 06:20
I wanted to use this tool with the lastest lambda (amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2) because it returned Segmentation fault, so I had to fix it: (1) clone repo (2) `$ LD_LIBRARY_PATH=./ ldd tesseract` to find NotFound libraries (3) ONLY these libraries are required: liblept.so.5, libpng12.so.0 and libtesseract.so.3 (4) It worked again! No need to copy any other one! — SergioArcos, Sep 26 '17 at 22:18
@SergioArcos Thanks! Your comment helped me with the same problem. I edited the answer. — José Augusto Paiva, Dec 01 '17 at 18:00
if you want to create searchable pdfs you also need so copy some other files from tessdata, you best just copy the whole tesdata directory with `cp -r /usr/local/share/tessdata .` — hansaplast, Jan 21 '18 at 13:43
Nice answer Worked well. First, I executed commands one-by-one but didn't succeed. Then created a script and executed it at once and successfully able to build the correct libs. Used the docker image `dacut/amazon-linux-python-3.6` to build libs. — knownUnknown, Jan 06 '19 at 08:38
@José Augusto Paiva How can I do it from docker container? I am using amazon/aws-lambda-provided:al2 image. What all files should be present to the lambda function if I run my scripts in `function.sh.handler`. — Akash Tadwai, Jul 02 '21 at 16:55

hansaplast · Answer 2 · 2018-03-23T05:51:58.630

Adapatations for tesseract 4:

Tesseract offers much improvements in version 4, thanks to a neural network. I've tried it with some scans and the improvements are quite substantial. Plus the whole package was 25% smaller in my case. Planned release date of version 4 is first half of 2018.

The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full. I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step.

To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance:

Compile leptonica

cd ~
sudo yum install clang -y
sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y
wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz
tar -xzvf leptonica-1.75.1.tar.gz
cd leptonica-1.75.1
./configure && make && sudo make install

Compile autoconf-archive

Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own:

cd ~
wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz
tar -xvf autoconf-archive-2017.09.28.tar.xz
cd autoconf-archive-2017.09.28
./configure && make && sudo make install
sudo cp m4/* /usr/share/aclocal/

Compile tesseract

cd ~
sudo yum install git-core libtool pkgconfig -y
git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
cd tesseract-ocr
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
sudo make install

Get all needed files and zip

cd ~
mkdir tesseract-standalone
cd tesseract-standalone
cp /usr/local/bin/tesseract .
mkdir lib
cp /usr/local/lib/libtesseract.so.4 lib/
cp /usr/local/lib/liblept.so.5 lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/
mkdir tessdata
cd tessdata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
# additionally any other language you want to use, e.g. `deu` for Deutsch
mkdir configs
cp /usr/local/share/tessdata/configs/pdf configs/
cp /usr/local/share/tessdata/pdf.ttf .
cd ..
zip -r ~/tesseract-standalone.zip *

http://babyname.tips/mirrors/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz returned 404, I downloaded from: http://ftpmirror.gnu.org/autoconf-archive/ — Jan Giacomelli, Mar 22 '18 at 22:42
@CharlieChen I start tesseract with `tesseract -l deu in.pdf out pdf`. For that it needs the pdf configuration (because the last argument is `pdf`) which is basically a properties file which says how the OCR transformation should be done, also it needs the language files (in this case `deu` for german) for better OCR quality — hansaplast, May 20 '18 at 06:36
I have followed steps above for tesseract 4 and works flawlessly in EC2 instance. After that, I've gathered all the files as above and deployed to AWS Lambda (done it before with other AWS Linux compiles, e.g., opencv2 with no issues). But now, ewhen trying to run my function, I keep getting the error 'pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path'. Any hints here ? — Pedro Sousa, Nov 06 '18 at 15:41
I'm trying to run this, but getting an error libpng15.so.15: cannot open shared object file does anyone know how to install it? (using libtesseract.so.5) — user3701979, Sep 25 '19 at 15:45

score 4 · Answer 3 · answered Mar 09 '20 at 20:19

Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7

I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function. Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2! This process takes less than 20 minutes using these steps and is reliably reproducible.

Summarized Steps:

Start an Amazon Linux EC2 instance (t2 micro will do just fine)

sudo yum update
sudo yum install git-core -y
sudo yum install docker -y
sudo service docker start
sudo usermod -a -G docker ec2-user #allows ec2-user to call docker

After running the 5th command you will need to logout and log back in for the change to take effect.

git clone https://github.com/amtam0/lambda-tesseract-api.git
cd lambda-tesseract-api/
bash build_tesseract4.sh #takes a few minutes
bash build_py37_pkgs.sh

This will generate .zip files for tesseract, pytesseract, pillow, and opencv. In order to use with lambda you need to complete two more steps.

Create Lambda layers, one for each zip file, and attach the layers to your Lambda function.
Create an Environment Variable. Key : PYTHONPATH and Value : /opt/

(Note: you will probably need to increase your Memory allocation and Timeout)

At this point you are all set to upload your code and start using Tesseract on AWS Lambda! Refer back to the Medium article for a test script.

Just a heads up: this really does work and it's pretty straightforward! — adriaanbd, Aug 10 '20 at 00:10
Thank you for sharing, It works for me as well. Just a useful note, I tried with Ubuntu 18.04 and Amazon Linux EC2 instances and I face problems executing the bash scripts. Only in Ubuntu 16.04 works! — David López, Mar 27 '21 at 02:53

HazimoRa3d · Answer 4 · 2020-03-01T19:48:42.000

3

Check this medium article on how to setup Tesseract 4.0.0 in lambda using Docker. It shows also how to convert python packages into layers

edited Mar 01 '20 at 19:48

answered Feb 29 '20 at 16:49

HazimoRa3d

517
5
12

Thank you for sharing, It works for me as well. Just a useful note, I tried with Ubuntu 18.04 and Amazon Linux EC2 instances and I face problems executing the bash scripts. Only in Ubuntu 16.04 works! – David López Mar 27 '21 at 02:54
can you please tell me how to do this on Java – abc123 Feb 14 '22 at 05:26

score 1 · Answer 5 · answered Jan 01 '23 at 18:04

1

Note that wget http://www.leptonica.com/source/leptonica-1.73.tar.gz does not work. They've move to leptonica.org so use wget http://www.leptonica.org/source/leptonica-1.83.0.tar.gz

answered Jan 01 '23 at 18:04

Perry

11
3