0

I have to recognize different date formats from a string as below using regular expression.

date can contain 21/12/2018
or 12/21/2018
or 2018/12/21
or 12/2018
or 21-12-2018
or 12-21-2018
or 2018-12-21
or 21-Jan-2018
or Jan 21,2018
or 21st Jan 2018
or 21-Jan-2018
or Jan 21,2018
or 21st Jan 2018
or Jan 21, 2018
or Jan 21, 2018
or 2018 Dec. 21
or 2018 Dec 21
or 21st of Jan 2018
or 21st of Jan 2018
or Jan 2018
or Jan 2018
or Jan. 2018
or Jan, 2018
or 2018
[should recognize (year only), (year and month), (year, month and day), year is mandatory in every date format to be recognized]  
[months are abbreviated to three letters, first letter capital]

my regular expression is as below,

\b(((((0?[1-9]|[12][0-9]|3[01])(\s*(st|nd|rd|th)?\s*(of)?\s*)?)|(20[012]\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[\/\-\.\,\s]*){1,3})\b

debuggex.com it is not working as expected and its getting other patterns also. I have to recognize three pattens (year only), (year and month), (year, month and day), year is mandatory in every date pattern to be recognized. What are the corrections needed for it to work properly? Please help.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Shijith
  • 4,602
  • 2
  • 20
  • 34
  • I didn't downvote, but I did mark as too broad. You need to write a really long regex alternation. – Tim Biegeleisen Aug 07 '19 at 07:43
  • 2
    Regexes are a bad choice for solving a problem where there are so many `or`s. I think you'd have a better time write a parser. – Arne Aug 07 '19 at 07:49
  • 1
    `21-12-2018 or 12-21-2018` – what are you going to do on 11th of December? – eumiro Aug 07 '19 at 07:56
  • @enumiro, these dates are from the column headings of 10-k docs for different companies, which i am trying to scrape. So no control on the input date format. – Shijith Aug 07 '19 at 08:02

1 Answers1

5

IIUC, dateutil.parser would be a better choice than re:

import dateutil.parser as dparser

l = ["21/12/2018","12/21/2018","2018/12/21","12/2018",
"21-12-2018","12-21-2018","2018-12-21","21-Jan-2018",
"Jan 21,2018","21st Jan 2018","21-Jan-2018","Jan 21,2018",
"21st Jan 2018","Jan 21, 2018","Jan 21, 2018","2018 Dec. 21",
"2018 Dec 21","21st of Jan 2018","21st of Jan 2018","Jan 2018",
"Jan 2018","Jan. 2018","Jan, 2018","2018"]

[str(dparser.parse(i, fuzzy=True)) for i in l]

Output:

['2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-07 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-01-21 00:00:00',
 '2019-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2019-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-12-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-21 00:00:00',
 '2018-01-07 00:00:00',
 '2018-01-07 00:00:00',
 '2018-01-07 00:00:00',
 '2018-01-07 00:00:00',
 '2018-08-07 00:00:00']

dateutil.parser can also handle if date-like things are inside a sentence (albeit this isn't always true):

s = 'The new millennium has finally come and it is now 1st of Jan 2000.'
str(dparser.parse(s, fuzzy=True))
# '2000-01-01 00:00:00'
Chris
  • 29,127
  • 3
  • 28
  • 51
  • For those who wonder about **IIUC**: _If I Understand Correctly_. – Ulysse BN Aug 07 '19 at 07:56
  • 1
    Thank you, but each of this dates will be a part of a string which i have to find and extract/replace. – Shijith Aug 07 '19 at 07:57
  • @Shijith `dateutil.parser` can also take care of such cases. Let me show some example. – Chris Aug 07 '19 at 08:02
  • Thank you very much. this is working. didn't knew this can be used to parse a string – Shijith Aug 07 '19 at 08:08
  • `dateutil.parser.parse(string_with_date, fuzzy_with_tokens=True)`, returns a tuple, the first element being a datetime.datetime object, the second a tuple containing the rest of the string(fuzzy tokens). eg. `string_with_date = 'date can contain 21st of January 2018 as a part of string '` output after applying the function will be `(datetime.datetime(2018, 1, 21, 0, 0), ('date can contain ', ' of ', ' ', 'as a part of string '))` – Shijith Aug 07 '19 at 09:08
  • @chris , can you please update the answer with `fuzzy_with_tokens=True` – Shijith Aug 07 '19 at 09:09