2

I am working on a project for which I need to extract Invoice numbers from email body. The invoice numbers could be anywhere on the mail body which I am trying to search using Python code. The problem is that the email senders do not used standard keywords, they used variety of word to mention invoice numbers, for ex. Invoice Number, invoice#, inv no., invoice no. inv-no etc.

This inconsistency makes it difficult for me to extract the invoice number from the mail body since there is no specific keyword.

After reading hundreds of emails I am able to identify most commons words which are used before invoice numbers and I have created a list of them (around 15 keywords). But I am not able to search that list of keywords into the string to retrieve the keywords next to them to identify the invoice number, also the invoice number could be both numeric and alpha-numeric which added more complexity.

I have tried to make some progress which is mentioned below but not getting the desired output.

inv_list = ['invoice number','inv no','invoice#','invoice','invoices','inv number','invoice-number','inv-number','inv#','invoice no.'] # list of keywords used before invoice number

example_string = 'Hi Team, Could you please confirm the status of payment 
for invoice# 12345678 and AP-8765432?
Also, please confirm the status of existing invoice no. 7652908.
Thanks'

# Basic code to test if any word from inv_list exists in example_string

for item in inv_list:
    if item in example_string:
        print(item)

# gives the output like 

invoice#
invoice no.

Next, after searching for couple of hours I found this function how to get a list with words that are next to a specific word in a string in python but I am not able to use this for a list of words. I tried:

def get_next_words(mailbody, invoice_text_list, sep=' '):
    mail_body_words = mailbody.split(sep)
    for word in invoice_text_list:
        if word in mail_body_words:
            yield next(mail_body_words)

words = get_next_words(example_string,inv_list)

for w in words:
    print(w)

and getting

TypeError: 'list' object is not an iterator

Expected output is to return keywords from 'example_string' which are followed by any keyword matching from 'inv_list' (I am assuming that I can identify the invoice number from the match returned)

For the given example the output should be:

Match1: 'invoice#'             
Expected Output: '12345678'

Match2: 'invoice no.'          
Expected Output:  '7652908'

Please let me know if further details are required, any help is appreciated!!

yash
  • 1,357
  • 2
  • 23
  • 34
ManojK
  • 1,570
  • 2
  • 9
  • 17
  • In the email, does the invoice numberss follow a particular pattern(s)? – yash May 06 '19 at 15:10
  • Unfortunately, they don't follow a pattern, they could be numeric or alpha-numeric with different character lengths, but even if a list of potential invoice numbers is extracted, it can be helpful. – ManojK May 06 '19 at 15:13
  • I think It'd be easier if you concentrate on the extracting a pattern of invoice #'s itself rather the preceding text. – yash May 06 '19 at 15:17
  • @yash - Thanks, that's what I am struggling with. The mail body can have other numbers like Account no., PO No. & Customer ID which look alike an Invoice number. – ManojK May 06 '19 at 15:25
  • Have you considered natural language processing instead? Python libraries for that field exist (NKLM is a well known one, I believe there are others). I wouldn't expect regex to be sufficient for a problem with this level of complexity. – jpmc26 May 08 '19 at 06:40
  • @jpmc26 - You are correct, I am also looking for an NLP based solution to achieve this more accurately, however when I search Google for "NKLM Python", I am not getting anything relevant, can you share more details if possible? – ManojK May 08 '19 at 07:26
  • I don't personally know really anything about NLP. It just struck me that formal language tooling, as useful as it is to us programmers, isn't really up to the job of dealing with natural language. – jpmc26 May 08 '19 at 07:33
  • Then it should be "NLTK", one of the most used Python libraries for NLP, thanks!! – ManojK May 08 '19 at 07:38
  • Sometimes I forget to double check things. =) Sorry about that. – jpmc26 May 08 '19 at 07:57

3 Answers3

1

You can use a similar approach to what you are using now but iterating on the opposite list. Also, to take advantage of time complexity of searching a dictionary rather than a list, turn your word list into the keys of a dictionary. It takes more space but will search much faster.

inv_list = {'invoice number','inv no','invoice#','invoice','invoices','inv number','invoice-number','inv-number','inv#','invoice no.'}

def get_next_words(mailbody, invoice_text_list, sep=' '):
    mail_body_words = mailbody.split(sep)
    for i in range(len(mail_body_words)):
        if mail_body_words[i] in invoice_text_list:
            yield mail_body_words[i+1]
        elif f'{mail_body_words[i]} {mail_body_words[i+1]}' in invoice_text_list:
            yield mail_body_words[i+1]
words = get_next_words(example_string, inv_list)

for w in words:
    print(w)
MyNameIsCaleb
  • 4,409
  • 1
  • 13
  • 31
  • Gotcha. I took a new approach and tested it now. Try that one out. – MyNameIsCaleb May 06 '19 at 15:41
  • As pointed out in [atsteich's answer](https://stackoverflow.com/a/56008479/6067149), a single word in `mail_body_words` will never match a keyword with spaces such as `'invoice number'`. – evergreen May 06 '19 at 16:14
  • That's true, I adjusted it. Although a better answer would be to use regex and probably faster. – MyNameIsCaleb May 06 '19 at 16:19
  • Thanks, this is close to what I want, but when we have keywords like 'invoice no. 7652908' it returns 'no.' as the identified text since there is a space between the word 'invoice' and 'no.'. As suggested I am trying to achieve this task using regex. I wanted to upvote this but don't have enough reputations. – ManojK May 07 '19 at 10:40
  • @MyNameIsCaleb - Can you edit the answer as there is a typo on line 6 in the function above, an extra character 'f' in the line is returning an error while running the code. – ManojK May 07 '19 at 10:42
  • What version of Python are you using? That is an `f string` which was added in Python 3.6 – MyNameIsCaleb May 07 '19 at 13:40
  • Also feel free to make your votes, they will count once your reputation goes up. – MyNameIsCaleb May 07 '19 at 13:40
  • ok, I am coding in Spyder that's why it was returning in error, also I have already upvoted this, Thanks!! – ManojK May 08 '19 at 06:07
1

maybe not the most efficient code, but working... The two cases are needed, to difference f.e. inv no 06363636 and inv 06363636 because of the whitespace between inv and no...

arr = example_string.split(' ')
for ix in range(len(arr)):
    try: 
        if arr[ix]+" "+arr[ix+1] in inv_list:
            print(arr[ix+2].strip('.'))
        elif arr[ix] in inv_list:
            print(arr[ix+1].strip('.'))
    except IndexError:
        pass
atsteich
  • 74
  • 5
  • Thank you for the answer, since it is the closest to the desired output, I am selecting this answer, I also did some edits to make it more usable for my case. – ManojK May 07 '19 at 10:49
0

I made some edits to answer given by atsteich to make it more useful in my scenario, basically I want to capture only numeric values as the invoice number and remove some extra punctuation which may come along with the invoice number.

Below is the code:

arr = example_string.split(' ')
remove_symbols = str.maketrans("","",".,-")

for ix in range(len(arr)):
    try: 
        if arr[ix]+" "+arr[ix+1] in inv_list and arr[ix+2].translate(remove_symbols).isdigit():
            print('Invoice number found:'+arr[ix+2].translate(remove_symbols))
        elif arr[ix] in inv_list and arr[ix+1].translate(remove_symbols).isdigit():
            print('Invoice number found:'+arr[ix+1].translate(remove_symbols))
     except IndexError:
        pass

Thanks everyone for the support!

ManojK
  • 1,570
  • 2
  • 9
  • 17