1

I am running a OCR processing function in AWS Lambda using TesseractJS. I had to ramp up the lambda function's memory to the max (1536 MB) for it to not crash because of memory issues. Even with this, the process almost reaches the max threshold:

Duration: 54133.61 ms   Billed Duration: 54200 ms Memory Size: 1536 MB  Max Memory Used: 1220 MB

The strange thing, and the reason I am posting this question, is why this is taking so much memory? If I run this same process in my development environment, which has merely 512MB of memory, it can still complete without any problems at all.

Images I am using for these tests are roughly only around 350KB.

Here is snippet of my code:

Tesseract.recognize(img)
  .catch(err => reject(err))
  .then(function(result) {
    Tesseract.terminate();
    console.log(result);
  }));
});

Here is a more complete version of my code:

lambda.js:

exports.handler = function(event, context, callback) {
    let snsMessage = getSNSMessageObject(
        JSON.stringify(event.Records[0].Sns.Message));
    let bucket = snsMessage.Records[0].s3.bucket.name;
    let key = snsMessage.Records[0].s3.object.key;

    let bookId = key.split('.')[0].split('/')[0];
    let pageNum = key.split('.')[0].split('/')[1];

    s3.getImage(bucket, key)
        .then(function(data) {
            return ocr.recognizeImage(data.Body);
        })
        .then(function(result) {
            return s3.uploadOCR(bucket, bookId, pageNum, result);
        })
        .then(fulfilled => callback(null))
        .catch(error => callback(error, 'Error'));

};

Helper functions:

    getImage: function getImage(bucket, key) {
        // Obtener la imagen de S3
        let params = {Bucket: bucket, Key: key};
        return s3.getObject(params).promise();
    },

    uploadOCR: function uploadOCR(bucket, bookId, pageNum, ocr) {
        // Subir el OCR JSON a S3
        let params = {
            Bucket: bucket,
            Key: (bookId + '/' + pageNum + '.json'),
            Body: ocr,
            ContentType: 'application/json'
        };

        return s3.putObject(params).promise();
    }

    recognizeImage: function recognizeImage(img) {
        return new Promise(function(resolve, reject) {
            // Procesar con TesseractJS
            Tesseract.recognize(img)
                .catch(err => reject(err))
                .then(function(result) {
                    Tesseract.terminate();

                    let ocr = {};
                    ocr['paragraphs'] = result.paragraphs.map(
                        p => ({'bbox': p.bbox, 'baseline': p.baseline,
                        'lines': p.lines.map(
                            l => ({'bbox': l.bbox, 'baseline': l.baseline,
                            'words': l.words.map(
                                w => ({'text': w.text, 'bbox': w.bbox,
                                'baseline': w.baseline}))
                            }))
                     }));
                    resolve(JSON.stringify(ocr));
                });
        });
Andres
  • 75
  • 8
  • "*`.catch(err => reject(err))`*" - uh, avoid the [`Promise` constructor antipattern](https://stackoverflow.com/q/23803743/1048572?What-is-the-promise-construction-antipattern-and-how-to-avoid-it)! – Bergi Nov 24 '17 at 05:23
  • What is the error message that it fails with when you set the AWS memory to the same as your dev machine? – Bergi Nov 24 '17 at 05:24
  • @Bergi There is no error message. The function simply ends prematurely and no Tesseract progress logs (from `progress()` callback) are shown in CloudWatch. – Andres Nov 24 '17 at 15:59
  • 1
    @Bergin Can you elaborate more on this Promise constructor antipattern? I am not using any deferreds, I am only using JavaScript native promises. The example above is from the TesseractJS documentation: https://github.com/naptha/tesseract.js#tesseractjob . The reason I am calling `reject()` inside that catch is because the code snippet above is chained to other promise operations (Get from S3, upload result to S3, etc.) – Andres Nov 24 '17 at 16:01
  • "*the code snippet above is chained to other promise operations*" - that's exactly what the promise constructor antipattern is about. If you only want to do chaining, you should not need to use the `new Promise` constructor. Can you post/link the full code? – Bergi Nov 24 '17 at 16:03
  • @Bergi Updated question with the full code. – Andres Nov 24 '17 at 16:09
  • 2
    Ah, I see. Tesseract not returning a real promise but rather a job object that is "*inspired by the ES6 Promise interface*" makes this a bit weird. It should be similar enough though to be able to assimilate it: `return Promise.resolve(Tesseract.recognize(img)).then(result => { Tesseract.terminate(); … return JSON.stringify(ocr); })` – Bergi Nov 24 '17 at 16:16
  • Thanks, I changed it and it works fine. But still not sure why the memory issue keeps happening... – Andres Nov 24 '17 at 16:39
  • Yeah, no idea on that either. Of course the machines are different, and IO is not deterministic, so maybe one is just too fast and tries to load everything into memory at once while the other is slower and uses the memory more sequentially… But that's just a quick guess, there are so many other possibilities. – Bergi Nov 24 '17 at 19:05
  • @Andres Have you had any luck figuring this out? I'm also working with tesseract and trying to get it running on AWS Lambda. You seem to be having more success than I am. – scotthorn0 Jan 23 '18 at 17:05
  • @scotthorn0 I wrote a [blog post](http://aalvarez.me/blog/posts/building-an-ocr-service-with-tesseractjs-in-aws-lambda.html) about it. You might find it helpful. Although I never fully understood why the memory issue was happening. – Andres Jan 23 '18 at 18:02

0 Answers0