How to find the number of line a string appears using pyPDF?

Question

I am using pyPDF4 to read a pdf File. The file has text like:

Abrechnung30.11.2022 0,00+ Kontostand/Rechnungsabschlussam30.11.2022 672,06H Rechnungsnummer:2022-11-3020:53:31.468209 01.12.2022 01.12.2022 Barausz.Debit.KFK

What I am trying to do is: 1.Read the pdf file 2. Find the line number where the string "Rechnungsnummer" appears and then I want to go to the next line and the line "Barausz." in order to extract the date and the category.

What I managed so far:

import PyPDF4
import re


with open('../../Desktop/Konto_202212.pdf', 'rb') as pdfFile:
    reader = PyPDF4.PdfFileReader(pdfFile)
    page1 = reader.getPage(1)
    text = page1.extractText()

    a=text.find('Rechnungsnummer')
    print(a)

But this returns me only the char index? But how to find the line number? So in the end text is a big string with a lot of "\n"

Or do you have another method?

Thank you very much for your help!

Kevin

text extraction is not very reliable I would try to get, for example, the full text block with `re.search(r"Abrechnung.+Barausz.Debit.KFK", text, re.S)` and the process it — cards, Jan 29 '23 at 20:53
I recommend to use `pypdf` (I'm the maintainer of pypdf and PyPDF2) — Martin Thoma, Feb 11 '23 at 15:15

Jorj McKie · Answer 1 · 2023-01-29T20:45:54.020

Try PyMuPDF instead:

import fitz  # package PyMuPDF

with fitz.open(filename) as doc:
    for page in doc:
        line_no = 0
        alltext = page.get_get("dict", flags=fitz.TEXTFLAGS_TEXT)
        for block in alltext["blocks"]:
            for line in block["lines"]:
                line_no += 1
                text = "".join([span["text"] for span in line["spans"]])
                if "Rechnungsnummer" in text:
                    print(f"Found 'Rechnungsnummer' in line {line_no} on page {page.number}.")

It seems though that you would like to actually find the invoice number. For this, a useful variant of page.get_text() is more promising:

import fitz  # package PyMuPDF

with fitz.open(filename) as doc:
    for page in doc:
        words = page.get_text("words", sort=True)
        for i, word in enumerate(words):
            if word[4] == "Rechnungsnummer:":
                number = words[i+1][4]
                print(f"Rechnungsnummer: {number} auf Seite {page.number}.")
                break

What we are doing here is extracting all strings containing no spaces, sort them vertically, then horizontally.

How to find the number of line a string appears using pyPDF?

1 Answers1