Problem description: I have PDF files in a S3 Bucket called "cases". I need to loop through all these PDFs, read each, and select PDFs based on keywords. The PDFs that contain the specified keywords need to be store in the "confirmed-covid19" bucket. Those PDFs without the specified keywords will be stored in the "no-covid" bucket.
Error: "ValueError: Filename must be a string."
Narrative: I ran the code in chunks to identify errors. The code shown below works above Line 37. The error is related to the code written below Line 37. My understanding is that the function 'upload_file' only take strings for the Filename and Key parameters. How can fix this issue, and put the selected PDFs containing keywords in the "confirmed-covid19" bucket? and the rest in the "no-covid" bucket? I still want to keep the original name of each PDF file. What is the most efficient way to achieve this task? Also, all suggestions to improve the code are welcome.
import PyPDF2
import re
import os
import textract
import boto3
import glob
from PyPDF2 import PdfFileReader
from io import BytesIO
# Call boto3 to access AWS S3:
s3 = boto3.resource(
service_name='s3',
region_name='us-east-1',
aws_access_key_id='MY_ACCESS_KEY_ID',
aws_secret_access_key='MY_SECRET_ACCESS_KEY'
)
# Define S3 Bucket name:
bucket_name = s3.Bucket("cases")
# define keywords
search_words = ['Covid-19','Corona','virus'] # Look for these words in PDFs.
# Clients provide a low-level interface to AWS
s3_client = boto3.client('s3')
for filename in bucket_name.objects.all(): # Object summary iterator.
body = filename.get()['Body'].read()
f = PdfFileReader(BytesIO(body)) # Read the content of each file
# Search for keywords
for i in range(f.numPages):
page = f.getPage(i) # get pages from pdf files
text = page.extractText() # extract the text from each page
search_text = text.lower().split()
# ------------------------------ Line 37 -------------------------------- #
for word in search_words: # look at each keyword
if word in search_text: # find the keyword(s) in the text
s3_client.upload_file(filename, 'confirmed-covid19', filename)
else:
s3_client.upload_file(filename, 'no-covid', filename)