2

I have gone through multiple links before posting this question so please read through and below are the two answers which have solved 90% of my problem:

parse multiple dates using dateutil

How to parse multiple dates from a block of text in Python (or another language)

Problem: I need to parse multiple dates in multiple formats in Python

Solution by Above Links: I am able to do so but there are still certain formats which I am not able to do so.

Formats which still can't be parsed are:

  1. text ='I want to visit from May 16-May 18'

  2. text ='I want to visit from May 16-18'

  3. text ='I want to visit from May 6 May 18'

I have tried regex also but since dates can come in any format,so ruled out that option because the code was getting very complex. Hence, Please suggest me modifications on the code presented on the link, so that above 3 formats can also be handled on the same.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
  • What about "May 16-18", "May 16 to 18", "May 16/17/18" or some other variants? The thing is, at some point, especially if you are dealing with free-form user input, you'll find that parsing all possible variants people can come up with is not going to be feasible. You might be better off rejecting things you can't *reasonably* parse. – bgse Sep 14 '17 at 14:15
  • "May 16-18" &, "May 16/17/18" also doesn't work but "May 16 to 18" this works in current code. In the end, I may have to reject but I was wondering if there is a solution which can handles these remaining 3-4 formats also. I guess what i need is if "/" or "-" also can be handled then it will work – Rahul Agarwal Sep 14 '17 at 14:26

1 Answers1

2

This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust:

from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re


def get_date_part(x):
    if x.lower() in month_list:
        return x

    day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)

    if day:
        return day.group(1)

    return False


def month_full(month):
    try:
        return datetime.strptime(month, '%B').strftime('%b')
    except:
        return datetime.strptime(month, '%b').strftime('%b')

tests = [
    'I want to visit from May 16-May 18',
    'I want to visit from May 16-18',
    'I want to visit from May 6 May 18',
    'May 6,7,8,9,10',
    '8 May to 10 June',
    'July 10/20/30',
    'from June 1, july 5 to aug 5 please',
    '2nd March to the 3rd January',
    '15 march, 10 feb, 5 jan',
    '1 nov 2017',
    '27th Oct 2010 until 1st jan',
    '27th Oct 2010 until 1st jan 2012'
    ]

cur_year = 2017    

month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

for date in tests:
    date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]

    days = []
    months = []
    years = []

    for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
        values = list(g)

        if k:
            months = map(month_full, values)
        else:
            for v in values:
                if 1900 <= int(v) <= 2100:
                    years.append(int(v))
                else:
                    days.append(v)

        if days and months:
            if years:
                dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]            
            else:
                dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
                years = [cur_year]

            # Fix for jumps in year
            dates = []
            start_date = datetime(years[0], 1, 1)
            next_year = years[0] + 1

            for d in dates_raw:
                if d < start_date:
                    d = d.replace(year=next_year)
                    next_year += 1
                start_date = d
                dates.append(d)

            print "{}  ->  {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))

This converts the test strings as follows:

I want to visit from May 16-May 18  ->  16/05/2017, 18/05/2017
I want to visit from May 16-18  ->  16/05/2017, 18/05/2017
I want to visit from May 6 May 18  ->  06/05/2017, 18/05/2017
May 6,7,8,9,10  ->  06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June  ->  08/05/2017, 10/06/2017
July 10/20/30  ->  10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please  ->  01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January  ->  02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan  ->  15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017  ->  01/11/2017
27th Oct 2010 until 1st jan  ->  27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012  ->  27/10/2010, 01/01/2012

This works as follows:

  1. First create a list of valid months names, i.e. both full and abbreviated.

  2. Make a translation table to make it easy to quickly remove any punctuation from the text.

  3. Split the text, and extract only the date parts by using a function with a regular expression to spot days or months.

  4. Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end.

  5. Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects.

  6. If a date appears to be before the previous one, add a whole year.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Just one problem I am using Python 2.7 and "maketrans" function is not there. Here is the link for it: https://stackoverflow.com/questions/19064378/how-to-resolve-str-has-no-attribute-maketrans-error-in-python. Can you please give some other method for same – Rahul Agarwal Sep 14 '17 at 15:50
  • I have updated it to 2.7. It does have that function, it is just used differently. – Martin Evans Sep 14 '17 at 15:51
  • Martin Evans: Thanks for the updated one!! But in this code, I can handle the one you mentioned above but not if date comes like "2nd march" or "1st January" etc. Also, if there are 3 dates in the text can the code publish all 3 dates. Let me know, we in your suggested solution can we handle both the scenarios. – Rahul Agarwal Sep 18 '17 at 14:19
  • It should now work with `st` `nd` `rd` `th`. It also now works with any number of dates. – Martin Evans Sep 18 '17 at 15:52
  • Martin: One more problem, the solution fails if the date is in DD/MM/YYYY format or like 27th Oct 2017. So, whenever the year is mentioned "Dates_raw" is either failing or it is not getting into that loop. – Rahul Agarwal Sep 27 '17 at 07:26
  • You could add a special case for `DD/MM/YYYY` using a regular expression to spot them and pass them to `strptime()`. I have improved it to deal with years though. – Martin Evans Sep 27 '17 at 08:51