I want to extract text from PDF file with Python's lib called pdfreader. I followed the instructions here:
https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-browse-document-pages
This is my code:
import requests
from io import StringIO, BytesIO
from pdfreader import SimplePDFViewer, PDFDocument
pdf_links = ['https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf',
'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf',
'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf',
'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf']
for pdf_link in pdf_links:
response = requests.get(pdf_link)
my_raw_data = response.content
#extract text page by page
with BytesIO(my_raw_data) as data:
viewer = SimplePDFViewer(data)
full_pdf_text = ''
total_page_num = len(list(viewer))
for i, page in enumerate(viewer):
text = page.strings
text = "".join(text)
text = text.strip().replace(' ', '\n\n').strip()
text = text.replace(' ', '\n\n')
print('PAGE', i)
The code does not give me any errors but the problem is that it does not iterate over pages.
Variable total_page_num
returns me number of pages (more than 1), but when I go in for loop it always goes into only one page (only first page)