0

I am using pyPDF4 to read a pdf File. The file has text like:

Abrechnung30.11.2022 0,00+ Kontostand/Rechnungsabschlussam30.11.2022 672,06H Rechnungsnummer:2022-11-3020:53:31.468209 01.12.2022 01.12.2022 Barausz.Debit.KFK

What I am trying to do is: 1.Read the pdf file 2. Find the line number where the string "Rechnungsnummer" appears and then I want to go to the next line and the line "Barausz." in order to extract the date and the category.

What I managed so far:

import PyPDF4
import re


with open('../../Desktop/Konto_202212.pdf', 'rb') as pdfFile:
    reader = PyPDF4.PdfFileReader(pdfFile)
    page1 = reader.getPage(1)
    text = page1.extractText()

    a=text.find('Rechnungsnummer')
    print(a)

But this returns me only the char index? But how to find the line number? So in the end text is a big string with a lot of "\n"

Or do you have another method?

Thank you very much for your help!

Kevin

Kevin
  • 1
  • 1

1 Answers1

0

Try PyMuPDF instead:

import fitz  # package PyMuPDF

with fitz.open(filename) as doc:
    for page in doc:
        line_no = 0
        alltext = page.get_get("dict", flags=fitz.TEXTFLAGS_TEXT)
        for block in alltext["blocks"]:
            for line in block["lines"]:
                line_no += 1
                text = "".join([span["text"] for span in line["spans"]])
                if "Rechnungsnummer" in text:
                    print(f"Found 'Rechnungsnummer' in line {line_no} on page {page.number}.")

It seems though that you would like to actually find the invoice number. For this, a useful variant of page.get_text() is more promising:

import fitz  # package PyMuPDF

with fitz.open(filename) as doc:
    for page in doc:
        words = page.get_text("words", sort=True)
        for i, word in enumerate(words):
            if word[4] == "Rechnungsnummer:":
                number = words[i+1][4]
                print(f"Rechnungsnummer: {number} auf Seite {page.number}.")
                break

What we are doing here is extracting all strings containing no spaces, sort them vertically, then horizontally.

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17