I'm trying to read text from a pdf stored in S3. Is there a way to read text from the stream, rather than creating a PDF locally and then converting it.
import boto3 as boto
from boto3.session import Session
session = Session(
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
s3 = session.resource('s3')
obj = s3.Object('my-bucket', 'file.pdf')
text = obj.get()['Body'].read()
print(text)
I've read that this returns a binary string, <botocore.response.StreamingBody object at 0x10d5a0fd0>
. But not sure how to get the text from that.
I'm also new-ish to Python.
How do I read this as text so I can parse that text?