I am new to Python and coding in general. I'm trying to create a program that will OCR a directory of PDFs then extract the text so I can later pick out specific things. However, I am having trouble getting pdfPlumber to extract all the text from all of the pages. You can index from start to an end, but if the end is unknown, it breaks because the index is out of range.
import ocrmypdf
import os
import requests
import pdfplumber
import re
import logging
import sys
import PyPDF2
## test folder C:\Users\adams\OneDrive\Desktop\PDF
user_direc = input("Enter the path of your files: ")
#walks the path and prints out each PDF in the
#OCRs the documents and skips any OCR'd pages.
for dir_name, subdirs, file_list in os.walk(user_direc):
logging.info(dir_name + '\n')
os.chdir(dir_name)
for filename in file_list:
file_ext = os.path.splitext(filename)[0--1]
if file_ext == '.pdf':
full_path = dir_name + '/' + filename
print(full_path)
result = ocrmypdf.ocr(filename, filename, skip_text=True, deskew = True, optimize = 1)
logging.info(result)
#the next step is to extract the text from each individual document and print
directory = os.fsencode(user_direc)
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith('.pdf'):
with pdfplumber.open(file) as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
As is, this will only take the text from the first page of each PDF. I want to extract all of the text from each PDF but pdfPlumber will break if my index is too large and I do not know the number of pages the PDF will have. I've tried
page = pdf.pages[0--1]
but this breaks as well. I have not been able to find a workaround with PyPDF2, either. I apologize if this sloppy code or unreadable. I've tried to add comments to kind of explain what I am doing.