Tesseract OCR on AWS Lambda via virtualenv
Scroll to Adapatations for tesseract 4. I have used this link to create the executable and the dependency libraries for tesseract. I have zipped everything and dropped in S3.
I am using lambda to download this zip, extract the dependencies in to /tmp folder. Now I am planning to use these dependencies in my lambda(python3 platform).
I am getting this error
Response:
{
"errorMessage": "tesseract is not installed or it's not in your path",
"errorType": "TesseractNotFoundError",
This is happening cause of not setting the environmental variable. I have tried to do it but cannot by pass this error.
# Setting the modules path
sys.path.insert(0, '/tmp/')
import boto3
import cv2
import numpy as np
import subprocess
os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['TESSDATA_PREFIX'] = "/tmp/tessdata/"
import pytesseract
I have set the environmental variables like this in the lambda function. Still I am getting the same error. I have even tried setting the variables like shown in the image below. Still hard luck.
I am sure this lambda package works because I have created a new ec2 instance, downloaded the same zip file and extracted the libraries into /tmp/ folder. I wrote a basic test function for testing tesseract. This works.
import cv2
import pytesseract
import os
# os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['LD_LIBRARY_PATH'] = '/tmp/lib:/tmp'
config = ('-l eng --oem 1 --psm 3')
im = cv2.imread('pytesseract/test-european.jpg', cv2.IMREAD_COLOR)
text = pytesseract.image_to_string(im, config=config)
print(text)
Can somebody tell me what did I do wrong with lambda. I don't want to zip everything because my zip file is greater than 50 MB. Also I want to try downloading the packages/modules/binaries from S3 to lambda and make it work.