Pdf corrupting when adding watermark in amazon s3 lamdba function

Question

I have s3 bucket which stores pdf's. We have to conditionally apply watermark on the pdf. We opted for s3 object lambda access point to achieve this. If we are saving the file back to s3 it is working fine but when returning dynamically in write_get_object_response the file is getting corrupted. Here is the code.

import boto3
import json
import os
import logging
from io import BytesIO
from urllib import request
from urllib.parse import urlparse
import PyPDF4

logger = logging.getLogger('S3-img-processing')
logger.addHandler(logging.StreamHandler())
logger.setLevel(getattr(logging, os.getenv('LOG_LEVEL', 'INFO')))


def apply_watermark_to_pdf(inputStream, watermarkStream):
    
    watermark = PyPDF4.PdfFileReader(watermarkStream)
    watermark_page = watermark.getPage(0)
    output_stream = BytesIO()
    
    
    pdf = PyPDF4.PdfFileReader(inputStream)
    
    # Create a new PDF writer
    pdf_writer = PyPDF4.PdfFileWriter()
    
    # Iterate through each page of the input PDF
    for page_number in range(pdf.getNumPages()):
        page = pdf.getPage(page_number)
        
        # Merge the watermark page with the current page
        page.mergePage(watermark_page)
        
        # Add the modified page to the PDF writer
        pdf_writer.addPage(page)
    
    pdf_writer.write(output_stream)

    output_stream.seek(0)

    return output_stream



def handler(event, context) -> dict:
    logger.debug(json.dumps(event))
    object_context = event["getObjectContext"]

    # Get the presigned URL to fetch the requested original object from S3
    s3_url = object_context["inputS3Url"]
    watermark_url = r'Presignedurl of watermark pdf'
    

    # Extract the route and request token from the input context
    request_route = object_context["outputRoute"]
    request_token = object_context["outputToken"]

    # Get the original S3 object using the presigned URL
    inputReq = request.Request(s3_url)
    watermarkReq = request.Request(watermark_url)
    try:
        inputResponse = request.urlopen(inputReq)
        watermarkResponse = request.urlopen(watermarkReq)
    except request.HTTPError as e:
        logger.info(f'Error downloading the object. Error code: {e.code}')
        logger.exception(e.read())
        return {'status_code': e.code}

    # Apply watermark to the PDF
    transformed_object = apply_watermark_to_pdf(BytesIO(inputResponse.read()), BytesIO(watermarkResponse.read()))

    # Write object back to S3 Object Lambda
    s3 = boto3.client('s3')

    # The WriteGetObjectResponse API sends the transformed data
    if os.getenv('AWS_EXECUTION_ENV'):
        s3.write_get_object_response(
            Body=transformed_object,
            RequestRoute=request_route,
            RequestToken=request_token)

    return {'status_code': 200}

Here is the error message while downloading the pdf

How big are the resulting PDFs? The maximum payload of a lambda response is 6mb so if the watermarked file is larger than that you can't send it directly via the lambda response. — apokryfos, May 30 '23 at 07:34
Thank you for the response.. They are very small around 1 -2 mb... — krishna, May 30 '23 at 08:14
There are many places where things could go wrong. I think this is currently too broad to debug and you'll likely not get many useful answers. Please isolate the problem and provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — Maurice, May 30 '23 at 15:31

score 0 · Answer 1 · answered May 31 '23 at 09:27

0

What we are actually missing is mentioning the content type for the response. Once that is added we are able to view pdfs.

answered May 31 '23 at 09:27

krishna

151
4
15

Pdf corrupting when adding watermark in amazon s3 lamdba function

1 Answers1