1

Tesseract OCR on AWS Lambda via virtualenv

Scroll to Adapatations for tesseract 4. I have used this link to create the executable and the dependency libraries for tesseract. I have zipped everything and dropped in S3.

I am using lambda to download this zip, extract the dependencies in to /tmp folder. Now I am planning to use these dependencies in my lambda(python3 platform).

I am getting this error

Response:
{
  "errorMessage": "tesseract is not installed or it's not in your path",
  "errorType": "TesseractNotFoundError",

This is happening cause of not setting the environmental variable. I have tried to do it but cannot by pass this error.

# Setting the modules path
sys.path.insert(0, '/tmp/')
import boto3
import cv2
import numpy as np
import subprocess
os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['TESSDATA_PREFIX'] = "/tmp/tessdata/"
import pytesseract

I have set the environmental variables like this in the lambda function. Still I am getting the same error. I have even tried setting the variables like shown in the image below. Still hard luck.

enter image description here

I am sure this lambda package works because I have created a new ec2 instance, downloaded the same zip file and extracted the libraries into /tmp/ folder. I wrote a basic test function for testing tesseract. This works.

import cv2
import pytesseract
import os

# os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['LD_LIBRARY_PATH'] = '/tmp/lib:/tmp'
config = ('-l eng --oem 1 --psm 3')
im = cv2.imread('pytesseract/test-european.jpg', cv2.IMREAD_COLOR)
text = pytesseract.image_to_string(im, config=config)
print(text)

Can somebody tell me what did I do wrong with lambda. I don't want to zip everything because my zip file is greater than 50 MB. Also I want to try downloading the packages/modules/binaries from S3 to lambda and make it work.

Random
  • 909
  • 1
  • 9
  • 14

2 Answers2

0

Apparently lambda doesn't allow you to make changes to the PATH variable.

Random
  • 909
  • 1
  • 9
  • 14
-1

Try adding this to your script

pytesseract.pytesseract.tesseract_cmd = r'/var/task/tesseract'