1

I want to extract text from PDF file with Python's lib called pdfreader. I followed the instructions here:

https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-browse-document-pages

This is my code:

import requests
from io import StringIO, BytesIO
from pdfreader import SimplePDFViewer, PDFDocument

pdf_links = ['https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf',
             'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf',
             'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf',
             'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf']

for pdf_link in pdf_links:

    response = requests.get(pdf_link)
    my_raw_data = response.content


    #extract text page by page
    with BytesIO(my_raw_data) as data:
        
        viewer = SimplePDFViewer(data)
        full_pdf_text = ''

        total_page_num = len(list(viewer))
        for i, page in enumerate(viewer):
            text = page.strings
            text = "".join(text)
            text = text.strip().replace('     ', '\n\n').strip()
            text = text.replace('  ', '\n\n')
            print('PAGE', i)

The code does not give me any errors but the problem is that it does not iterate over pages. Variable total_page_num returns me number of pages (more than 1), but when I go in for loop it always goes into only one page (only first page)

taga
  • 3,537
  • 13
  • 53
  • 119

1 Answers1

2

Solving this issue required a lot of documentation reading for the Python module pdfreader. I was shocked at the level of difficulty in using this module for simple text extraction. It took hours to figure out a working solution.

The code below will enumerate the text on individual pages. You will still need to do some text cleaning to get your desired output.

I noted that one of your PDFs is having a problem with some font encoding during the parsing, which throws a warning message.

import requests
from io import BytesIO
from pdfreader import SimplePDFViewer

pdf_links = [
    'https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf',
    'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf',
    'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf',
    'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf']

for pdf_link in pdf_links:

    response = requests.get(pdf_link, stream=True)

    # extract text page by page
    with BytesIO(response.content) as data:

        viewer = SimplePDFViewer(data)

        all_pages = [p for p in viewer.doc.pages()]
        number_of_pages = len(all_pages)
        for page_number in range(1, number_of_pages + 1):
            viewer.navigate(int(page_number))
            viewer.render()
            page_strings = " ".join(viewer.canvas.strings).replace('     ', '\n\n').strip()
            print(f'Current Page Number: {page_number}')
            print(f'Page Text: {page_strings}')
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • I have started getting this error: https://github.com/maxpmaxp/pdfreader/issues/77 , do you know what can be the problem? Error is: No such file or directory: '/usr/local/lib/python3.7/dist-packages/pdfreader/codecs/cmaps/Identity-H' – taga Mar 16 '21 at 09:11
  • I want to deploy this on AWS lambdas – taga Mar 16 '21 at 09:13
  • Deploying this AWS lambdas would be interesting. How do you plan to scrape(walk) the URLs for all your sources? Where do you plan to put the output of all the extraction? – Life is complex Mar 17 '21 at 03:24