AWS lambda open pdf using PyPDF2

Question

i was trying to open a PDF using python library PyPDF2 in AWS Lambda but its giving me access denied

Code

from PyPDF2 import PdfFileReader
    pdf = PdfFileReader(open('S3 FILE URL', 'rb'))

    if pdf.isEncrypted:
        pdf.decrypt('')

    width = int(pdf.getPage(0).mediaBox.getWidth())
    height = int(pdf.getPage(0).mediaBox.getHeight())

my bucket permission

Block all public access
 Off
Block public access to buckets and objects granted through new access control lists (ACLs)
 Off
Block public access to buckets and objects granted through any access control lists (ACLs)
 Off
Block public access to buckets and objects granted through new public bucket or access point policies
 Off
Block public and cross-account access to buckets and objects through any public bucket or access point policies
 Off

Maybe unrelated to PyPDF2? https://stackoverflow.com/questions/54690307/aws-lambda-returns-permission-denied-trying-to-getobject-from-s3-bucket Are there any other debugging attempts that you may have already done that could help understand the problem? — ssice, Sep 17 '21 at 13:27
is there any way i can find height and width of pdf on Lambda without actually downloading it on my localmachine — Robin, Sep 18 '21 at 06:57
https://pypdf2.readthedocs.io/en/latest/user/streaming-data.html a reading example is missing, but you can simply put any bytestream into PdfReader — Martin Thoma, Dec 20 '22 at 18:41

score 1 · Answer 1 · answered Sep 17 '21 at 13:27

You're skipping a step by trying to use open() to fetch a URL: open() can only action files on the local filesystem - https://docs.python.org/3/library/functions.html#open

You'll need to use urllib3/etc. to fetch the file from S3 first (assuming the bucket is also publicly-accessible, as Manish pointed out).

urllib3 usage suggestion: What's the best way to download file using urllib3

So combining the two:

pdf = PdfFileReader(open('S3 FILE URL', 'rb'))

becomes (something like)

import urllib3

def fetch_file(url, save_as):
  http = urllib3.PoolManager()
  r = http.request('GET', url, preload_content=False)

  with open(save_as, 'wb') as out:
    while True:
        data = r.read(chunk_size)
        if not data:
            break
        out.write(data)

  r.release_conn()

if __name__ == "__main__":
  pdf_filename = "my_pdf_from_s3.pdf"
  fetch_file(s3_file_url, pdf_filename)
  pdf = PdfFileReader(open(pdf_filename, 'rb'))

is there any way i can find height and width of pdf on Lambda without actually downloading it on my local machine' — Robin, Sep 18 '21 at 06:58
It (or with minor tweaks) should work on Lambda, yes. Unless the hosting site publishes metadata on the file, you'll need to download it (your machine or Lambda) — Adam Smooch, Sep 19 '21 at 00:14

score 0 · Answer 2 · edited Sep 17 '21 at 13:25

0

I believe you have to make changes in this section of your S3 bucket in the AWS console. I believe this should solve your issue.

edited Sep 17 '21 at 13:25

ssice

3,564
1
26
44

answered Sep 17 '21 at 13:07

Manish Jain

23
3

AWS lambda open pdf using PyPDF2

2 Answers2