0

I require to find out the phone bill due date from SMS using Python 3.4 I have used dateutil.parser and datefinder but with no success as per my use-case.

Example: sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc@xyz.com. Pls check Inbox"

Code 1:

import datefinder
due_dates = datefinder.find_dates(sms_text)
for match in due_dates:
    print(match)

Result: 2017-07-17 00:00:00

Code 2:

import dateutil.parser as dparser
due_date = dparser.parse(sms_text,fuzzy=True)
print(due_date)

Result: ValueError probably because of multiple dates in the text

How can I pick the due date from such texts? The date format is not fixed but there would be 2 dates in the text: one is the month for which bill is generated and other the is the due date, in the same order. Even if I get a regular expression to parse the text, it would be great.

More sample texts:

  1. Hello! Your phone billed outstanding is 293.72 due date is 03rd Jul.
  2. Bill dated 06-JUN-17 for Rs 219 is due today for your phone No. 1234567890
  3. Bill dated 06-JUN-17 for Rs 219 is due on Jul 5 for your phone No. 1234567890
  4. Bill dated 27-Jun-17 for your operator fixedline/broadband ID 1234567890 has been sent at abc@xyz.com from xyz@abc.com. Due amount: Rs 3,764.53, due date: 16-Jul-17.
  5. Details of bill dated 21-JUN-2017 for phone no. 1234567890: Total Due: Rs 374.12, Due Date: 09-JUL-2017, Bill Delivery Date: 25-Jun-2017,
  6. Greetings! Bill for your mobile 1234567890, dtd 18-Jun-17, payment due date 06-Jul-17 has been sent on abc@xyz.com
  7. Dear customer, your phone bill of Rs.191.24 was due on 25-Jun-2017
  8. Hi! Your phone bill for Rs. 560.41 is due on 03-07-2017.
Drunk Knight
  • 131
  • 1
  • 2
  • 14
  • If your strings are as simple as this, you can just use regex. – cs95 Jul 13 '17 at 12:14
  • @cᴏʟᴅsᴘᴇᴇᴅ I'd love to sir... the strings are simple but date format may vary. Also, I am not very good with regex. If the result is extracting the due_date, a regex would also be perfect for me. – Drunk Knight Jul 13 '17 at 12:17
  • When you say the date format my wary, that rings a few alarm bells. What possible date formats would you encounter? There's no point having a regex that works for one format but fails for everything else. – cs95 Jul 13 '17 at 12:18
  • 1
    due dates can be: YYYY-MM-DD, DD-MM-YYYY, MMMDD, DDMMM. Bill Month can be: MMM-YY, MMM'YY, MMM YYYY. These are few examples I have encountered. As the format was not fixed, I was trying to solve it using Python3.x utilities which can detect different date formats – Drunk Knight Jul 13 '17 at 12:21
  • My apologies. I'm not sure regex can handle so many formats. – cs95 Jul 13 '17 at 12:28
  • Would the answer to this question fit your data sample? https://stackoverflow.com/questions/7028689/how-to-parse-multiple-dates-from-a-block-of-text-in-python-or-another-language – BoboDarph Jul 13 '17 at 12:53
  • What is the `Rs.72.23` part? Is it always located between the two dates? Because this is the part that messes up `datefinder`. – Gall Jul 13 '17 at 13:03
  • It is the bill amount. **Rs.** is the currency notation in India. It may or may not be located between the two dates – Drunk Knight Jul 13 '17 at 13:04

4 Answers4

3

An idea for using dateutil.parser:

from dateutil.parser import parse

for s in sms_text.split():
    try:
        print(parse(s))
    except ValueError:
        pass
ISV
  • 27
  • 1
2

There are two things that prevent datefinder to parse correctly your samples:

  1. the bill amount: numbers are interpreted as years, so if they have 3 or 4 digits it creates a date
  2. characters defined as delimiters by datefinder might prevent to find a suitable date format (in this case ':')

The idea is to first sanitize the text by removing the parts of the text that prevent datefinder to identify all the dates. Unfortunately, this is a bit of try and error as the regex used by this package is too big for me to analyze thoroughly.

def extract_duedate(text):
    # Sanitize the text for datefinder by replacing the tricky parts 
    # with a non delimiter character
    text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)

    return list(datefinder.find_dates(text))[-1]

Rs[\d,\. ]+ will remove the bill amount so it is not mistaken as part of a date. It will match strings of the form 'Rs[.][ ][12,]345[.67]' (actually more variations but this is just to illustrate).

Obviously, this is a raw example function. Here are the results I get:

1 : 2017-07-03 00:00:00
2 : 2017-06-06 00:00:00 # Wrong result: first date instead of today
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00

There is one problem on the sample 2: 'today' is not recognized alone by datefinder

Example:

>>> list(datefinder.find_dates('Rs 219 is due today'))
[datetime.datetime(219, 7, 13, 0, 0)]
>>> list(datefinder.find_dates('is due today'))
[]

So, to handle this case, we could simply replace the token 'today' by the current date as a first step. This would give the following function:

def extract_duedate(text):
    if 'today' in text:
        text = text.replace('today', datetime.date.today().isoformat())

    # Sanitize the text for datefinder by replacing the tricky parts 
    # with a non delimiter character
    text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)

    return list(datefinder.find_dates(text))[-1]

Now the results are good for all samples:

1 : 2017-07-03 00:00:00
2 : 2017-07-18 00:00:00 # Well, this is the date of my test
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00

If you need, you can let the function return all dates and they should all be correct.

Gall
  • 1,595
  • 1
  • 14
  • 22
  • Agreed. I am working along similar lines, trying to split the text and check individual words for a date. In your solution, I think regex might need some modification as **datetime.datetime(2017, 7, 17, 0, 0)** is no date in the text. It is still somehow referring to something else – Drunk Knight Jul 13 '17 at 13:27
  • @DrunkKnight Actually, I think the year is filled as of today by dateutil. Anyway, I'm testing with your added examples, it almost works for all the cases. – Gall Jul 13 '17 at 13:31
  • Does the regex work or you are using Python utility? – Drunk Knight Jul 13 '17 at 13:55
  • @DrunkKnight I'm using the same method as in my answer trying to find a better regex. The idea would be to find the parts that cause issues so you can sanitize the string first. – Gall Jul 13 '17 at 14:04
  • @DrunkKnight I updated my answer, this is not perfect but I hope this will help. – Gall Jul 13 '17 at 14:44
  • @ Gall Laptop crashed; so got delayed in responding. Thanks for the solution that you have posted. It solves most problems, although I am struggling to understand the regex. Can we do something like iterating all dates assuming that largest date is going to be the due date? – Drunk Knight Jul 17 '17 at 16:20
  • @DrunkKnight Good point, I'll add some info about the regex. Basically, it only removes some words (the extra tokens: from, to, due, etc...) and the bill amount by trying to accommodate all forms found in the examples. You can process the found dates this way, if change the return line of the function you should have all dates found. I just don't see an easier/better way as you have a lot of text and date formats. – Gall Jul 18 '17 at 07:26
  • @DrunkKnight Also, I have an idea to handle the 'today' case. Does this case need to return the current date when parsing the text? – Gall Jul 18 '17 at 07:27
  • I did something like this: – Drunk Knight Jul 18 '17 at 13:19
  • Before calling this function, I checked for "today" in text, then set due_date as current_date – Drunk Knight Jul 18 '17 at 13:22
  • @DrunkKnight I'm not sure splitting the text is a good idea, you will split some dates. I updated my answer with something I think a little bit better. Also, if you remove the `[-1]` part of the function, you will get the complete list of datetime objects. – Gall Jul 18 '17 at 13:25
0

Why not just using regex? If your input strings always contain this substrings due on ... has been you can just do something like that:

import re
from datetime import datetime

string = """Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been
 sent to your regd email ID abc@xyz.com. Pls check Inbox"""

match_obj = re.search(r'due on (.*) has been', string)

if match_obj:
    date_str = match_obj.group(1)
else:
    print "No match!!"
try:
    # DD-MM-YYYY
    print datetime.strptime(date_str, "%d-%m-%Y")
except ValueError:
    # try another format
    try:
        print datetime.strptime(date_str, "%Y-%m-%d")
    except ValueError:
        try:
            print datetime.strptime(date_str, "%m-%d")
        except ValueError:
            ...
Alexey
  • 1,366
  • 1
  • 13
  • 33
0

Having a text message as the example you have provided:

sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc@xyz.com. Pls check Inbox"

It could be possible to use pythons build in regex module to match on the 'due on' and 'has been' parts of the string.

import re

sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc@xyz.com. Pls check Inbox"

due_date = re.split('due on', re.split('has been', sms_text)[0])[1]

print(due_date)

Resulting: 15-07-2017

With this example the date format does not matter, but it is important that the words you are spliting the string on are consistent.

  • It is similar to what Alexey has posted, but the text is not consistent wherein lies the entire problem. – Drunk Knight Jul 13 '17 at 13:00
  • Could you maybe add the restrictions to the question? Since it is valuable information for people trying to help you. – Daan ter horst Jul 13 '17 at 13:02
  • 1
    I am sincerely hoping that someone would provide some solution without the restrictions, as it is essentially date extraction, the solution to which Python provides. Adding restrictions will neither help me or nor we will get a fool-proof solution – Drunk Knight Jul 13 '17 at 13:07