0

I am working on an information retrieval task in Python, I am trying to extract Invoice Number from PDF files, for which I have converted them to strings (keeping the original format), some PDF files have multiple Invoice numbers in a table format. Below is an example of one PDF Invoice:

Invoice Number Date         Ac.No.      Type         Amount

1654339087      28.01.2019  1508765556  Invoice      1,268.40
1655214567      18.12.2018  3508753550  Invoice      3,134.20
1609833445      12.02.2019  2500444556  Invoice      2,735.84

To extract Invoice Number from these tables I have created a Regex which helps me to extract the Invoice Number from such tables, to capture invoice number from multiple lines I have repeated the last part of regex, for example in above scenario, I will repeat the last part of regex (.+\n(\d{5,})) 3 times, this works fine but problem is that I don't know how many such lines could be in the PDF file for example 10,20. In that scenario I need to repeat this part of regex equal to the number of lines, I am looking for an efficient solution where I can mention a number (equal to the total lines) in the regex or multiply the regex with some number to repeat.

For example something like this: (.+\n(\d{5,})*10) or (.+\n(\d{5,}){10}). I found few similar answers (not exactly same) mentioning about using {} to pass the number but this doesn't work in my case. Below is the regex I have created:

pattern = re.compile(r'Invoice Number\s*[A-Za-z0-9-._:\s]+\n(\d{5,}).+\n(\d{5,}).+\n(\d{5,})',re.IGNORECASE | re.MULTILINE)

And expected output is (which I am currently getting by repeating a part of regex 3 times):

1654339087
1655214567
1609833445

Any help here is appreciated!!

ManojK
  • 1,570
  • 2
  • 9
  • 17
  • If you need it badly, use PyPi regex module, see [this answer](https://stackoverflow.com/a/9765390/3832970). Else, capture the part you need as a whole and then apply another, simpler regex search on the extracted chunk only. – Wiktor Stribiżew Jul 01 '19 at 08:12
  • could there be gaps between matched lines considering 10 lines constraint? say 3 lines matched, then next 5 lines - not matched, then 7 lines - matched ? – RomanPerekhrest Jul 01 '19 at 08:26
  • @ RomanPerekhrest - No, there would not be a gap, the table will have one header (Invoice Number) and then multiple lines, each having an invoice number in it. So what I think if first line after the header returns in a match then rest of the lines will also match. – ManojK Jul 01 '19 at 10:46
  • Great, you may post it as an answer, BTW. – Wiktor Stribiżew Jul 01 '19 at 10:53
  • @Wiktor Stribiżew - Thanks for comment, though the question mentioned in the link provided by you is slightly different but it helped me to look into the python regex module, however what worked for me is your answer at: https://stackoverflow.com/questions/46603805/capture-repeated-groups-in-python-regex . Using this solution I have created a new working regex for now, though it is not exactly the way I wanted but much better than my previous solution, regex is: `pattern = r'(?:^(?=.*Invoice Number)|\G(?!^)).*?\s*[A-Za-z0-9-._:\s]+\n(\d{5,})'` – ManojK Jul 01 '19 at 11:01

2 Answers2

1

You may try reading the file line by line, starting with the second line:

f = open('your_file.txt')
line = f.readline()      # consume the header

while line:
    line = f.readline()
    print(re.search(r'^\d+', line).group())  # print the invoice number

f.close()
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • The question is not about matching numbers at the start of a line, but how to capture part of a string in a repeated group. Please revert the dupe status. – Wiktor Stribiżew Jul 01 '19 at 08:24
  • @WiktorStribiżew I disagree with your comment. In my experience, a single PDF (e.g. generated by something like Jasper reports) would only have a _single_ header. Therefore, the OP's question regarding repeating a regex refers to applying the same pattern for an invoice _line_, not matching the entire header + lines pattern repeatedly. – Tim Biegeleisen Jul 01 '19 at 08:26
  • @Tim Biegeleisen Thanks for help, but while trying this only reads the header and not the lines below. – ManojK Jul 01 '19 at 09:00
  • @manojk I am surprised by this, and AFAIK the while loop should be iterating over every line in your file. Maybe you have some other issue with the script you are using. – Tim Biegeleisen Jul 01 '19 at 09:02
  • @Tim Biegeleisen - Yes sir, quite possible, as of now I made a working solution using regex package. – ManojK Jul 01 '19 at 10:46
1

As suggested by @Wiktor Stribiżew on another SO post Capture repeated groups in python regex below solution worked for me using regex https://pypi.org/project/regex/:

import regex
pattern = r'(?:^(?=.*Invoice Number)|\G(?!^)).*?\s*[A-Za-z0-9-._:\s]+\n(\d{5,})'##Capture digit only having length more than 5
print(regex.findall(pattern,text,regex.M))
ManojK
  • 1,570
  • 2
  • 9
  • 17