0

Problem description: I have PDF files in a S3 Bucket called "cases". I need to loop through all these PDFs, read each, and select PDFs based on keywords. The PDFs that contain the specified keywords need to be store in the "confirmed-covid19" bucket. Those PDFs without the specified keywords will be stored in the "no-covid" bucket.

Error: "ValueError: Filename must be a string."

Narrative: I ran the code in chunks to identify errors. The code shown below works above Line 37. The error is related to the code written below Line 37. My understanding is that the function 'upload_file' only take strings for the Filename and Key parameters. How can fix this issue, and put the selected PDFs containing keywords in the "confirmed-covid19" bucket? and the rest in the "no-covid" bucket? I still want to keep the original name of each PDF file. What is the most efficient way to achieve this task? Also, all suggestions to improve the code are welcome.

import PyPDF2
import re
import os
import textract
import boto3
import glob
from PyPDF2 import PdfFileReader
from io import BytesIO

# Call boto3 to access AWS S3:
s3 = boto3.resource(
     service_name='s3',
     region_name='us-east-1',
     aws_access_key_id='MY_ACCESS_KEY_ID',
     aws_secret_access_key='MY_SECRET_ACCESS_KEY'
)

# Define S3 Bucket name:
bucket_name = s3.Bucket("cases")

# define keywords
search_words = ['Covid-19','Corona','virus'] # Look for these words in PDFs.

# Clients provide a low-level interface to AWS
s3_client = boto3.client('s3')

for filename in bucket_name.objects.all(): # Object summary iterator.
    body = filename.get()['Body'].read()
    f = PdfFileReader(BytesIO(body))       # Read the content of each file
    
    # Search for keywords
    for i in range(f.numPages):
        page = f.getPage(i)          # get pages from pdf files
        text = page.extractText()    # extract the text from each page
        search_text = text.lower().split()
        
# ------------------------------ Line 37 -------------------------------- #   

        for word in search_words:         # look at each keyword 
            if word in search_text:       # find the keyword(s) in the text
                s3_client.upload_file(filename, 'confirmed-covid19', filename)
            else:
                s3_client.upload_file(filename, 'no-covid', filename)
Michael H
  • 19
  • 3
  • Your `filename` in `for filename in bucket_name.objects.all()` is not a string. It will be an instance of a [S3 Object](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object) class. – Marcin Nov 19 '20 at 03:23
  • 1
    This is a good observation. The 'filename' is a s3.ObjectSummary containing bucket_name and key. The key is a string type containing the PDF file name. I think in order to get the PDF file name uploaded in a S3 bucket I need to get the s3.ObjectSummary key name which is a string. I am trying to figure how to do this. – Michael H Nov 19 '20 at 03:43

2 Answers2

0

You should try the pdfminer module. It extracts the text from the PDF and writes a txt file.

  • Thank you for your suggestion. I forgot to mention I would like to keep the PDF file format. I am sure this module is useful for some cases. I am sorry Sir. but I don't see how this can help me upload the selected PDFs in a s3 bucket based on keywords. Maybe I am not fully understanding your suggestion. – Michael H Nov 19 '20 at 02:47
  • @justinlachap Can I use PDF miner to extract text from pdf that is stored in s3? I don't want to download the file locally. I want to extract directly from PDF stored in s3. – d_b Jan 22 '21 at 11:05
0

@Michael

Read the boto docs:

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html

The upload file, uploads a local file to s3. You dont have a local file. You can create it locally and upload it or copy from the file from one bucket to another.

This is solved here:

how to copy s3 object from one bucket to another using python boto3

Best Regards.