Read text from a PDF stored in S3 (Python)

Asked Nov 14 '17 at 00:17

Active Nov 14 '17 at 01:11

Viewed 1,881 times

I'm trying to read text from a pdf stored in S3. Is there a way to read text from the stream, rather than creating a PDF locally and then converting it.

import boto3 as boto
from boto3.session import Session

session = Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)

s3 = session.resource('s3')

obj = s3.Object('my-bucket', 'file.pdf')

text = obj.get()['Body'].read()

print(text)

I've read that this returns a binary string, <botocore.response.StreamingBody object at 0x10d5a0fd0>. But not sure how to get the text from that.

I'm also new-ish to Python.

How do I read this as text so I can parse that text?

edited Nov 14 '17 at 01:11

asked Nov 14 '17 at 00:17

tim_xyz

11,573
17
52
97

How to read a file from S3 and how to convert a pdf to text are two distinct questions. Solve one at a time. – jordanm Nov 14 '17 at 01:04
I guess in this case, I don't know where the one question ends and the next begins, as I don't actually want to actually create a pdf locally before reading from it. I was hoping I could just read from the stream (maybe from a temporary file?) – tim_xyz Nov 14 '17 at 01:07

Read text from a PDF stored in S3 (Python)

0 Answers0