I am working on an information retrieval task in Python, I am trying to extract Invoice Number from PDF files, for which I have converted them to strings (keeping the original format), some PDF files have multiple Invoice numbers in a table format. Below is an example of one PDF Invoice:
Invoice Number Date Ac.No. Type Amount
1654339087 28.01.2019 1508765556 Invoice 1,268.40
1655214567 18.12.2018 3508753550 Invoice 3,134.20
1609833445 12.02.2019 2500444556 Invoice 2,735.84
To extract Invoice Number from these tables I have created a Regex which helps me to extract the Invoice Number from such tables, to capture invoice number from multiple lines I have repeated the last part of regex, for example in above scenario, I will repeat the last part of regex (.+\n(\d{5,})
) 3 times, this works fine but problem is that I don't know how many such lines could be in the PDF file for example 10,20. In that scenario I need to repeat this part of regex equal to the number of lines, I am looking for an efficient solution where I can mention a number (equal to the total lines) in the regex or multiply the regex with some number to repeat.
For example something like this: (.+\n(\d{5,})*10
) or (.+\n(\d{5,}){10}
). I found few similar answers (not exactly same) mentioning about using {}
to pass the number but this doesn't work in my case. Below is the regex I have created:
pattern = re.compile(r'Invoice Number\s*[A-Za-z0-9-._:\s]+\n(\d{5,}).+\n(\d{5,}).+\n(\d{5,})',re.IGNORECASE | re.MULTILINE)
And expected output is (which I am currently getting by repeating a part of regex 3 times):
1654339087
1655214567
1609833445
Any help here is appreciated!!