I am using pyPDF4 to read a pdf File. The file has text like:
Abrechnung30.11.2022 0,00+ Kontostand/Rechnungsabschlussam30.11.2022 672,06H Rechnungsnummer:2022-11-3020:53:31.468209 01.12.2022 01.12.2022 Barausz.Debit.KFK
What I am trying to do is: 1.Read the pdf file 2. Find the line number where the string "Rechnungsnummer" appears and then I want to go to the next line and the line "Barausz." in order to extract the date and the category.
What I managed so far:
import PyPDF4
import re
with open('../../Desktop/Konto_202212.pdf', 'rb') as pdfFile:
reader = PyPDF4.PdfFileReader(pdfFile)
page1 = reader.getPage(1)
text = page1.extractText()
a=text.find('Rechnungsnummer')
print(a)
But this returns me only the char index? But how to find the line number? So in the end text is a big string with a lot of "\n"
Or do you have another method?
Thank you very much for your help!
Kevin