0

im trying to retrieve a date from a string. the problem is that the pattern of this date varies a lot (string comes from an OCR reading). These are the patterns i need to identify:

  • 11/11/1111 (i can get this one already)
  • 11-11-1111 (i can get this one already)
  • 11 11 1111 (i can get this one already)
  • 11- 11- 1111
  • 11 11 1111
  • 11-11 1111
  • 23- 10-17
  • 9 06- 17

So far, the RegEx I have is a slight adaptation (it now allows spaces instead of just - or / separating the numbers) from a stackoverflow answer :

match_date=re.search(r'(?:(?:31(\/|-|\.| )(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.| )(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.| )(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})',line)

Is there a way of building a regex for such a "fluid" date structure?

mbc
  • 91
  • 3
  • 11

3 Answers3

2

Regex: \b(?:\d{1,2}[- /]\s?){2}(?:\d{4}|\d{2})\b or ^(?:\d{1,2}[- /]\s?){2}(?:\d{4}|\d{2})$

Regex demo

Srdjan M.
  • 3,310
  • 3
  • 13
  • 34
1

You could go for

\b\d{1,2}[- /]+\d{1,2}[- /]+\d{2,4}\b

See a demo on regex101.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

I know regex is a better answer because with one line you can match all possibilities but I prefer convert to datetime

from datetime import datetime
string = "11- 11- 1111"

for fmt in ('%Y-%m-%d', '%d- %m- %Y', '%d %m %Y', '%d- %m- %y'):
    try:
       datetime_object = datetime.strptime(string, '%d- %m- %y')
...
Joao Vitorino
  • 2,976
  • 3
  • 26
  • 55