-1

I'm new at Python but have to make a regex to pick up dates in format dd-mm-yyyy form text. I wrote something like this:

format1 = re.findall('[0-2][0-9]-02-(\d){4}|(([0-2][0-9]|30)-(04|06|09|11)-(\d){4})|(([0-2][0-9]|30|31)-(01|03|05|07|08|10|12)-(\d){4})',article)

It also checks if date format is correct. I checked if it works at pythex.org I returns the right dates but unfortunately also some empty matches and random numbers:

Match 1
1.  None
2.  None
3.  None
4.  None
5.  None
6.  21-10-2005
7.  21
8.  10
9.  5

Match 2
1.  None
2.  None
3.  None
4.  None
5.  None
6.  31-12-1993
7.  31
8.  12
9.  3

How can I improve the regex to return only dates or drop everything that isn't a date?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Rabbit
  • 37
  • 6
  • I'm a little confused. What exactly is the return you're looking for? For example, if article = '10-10-1010' and you pass it to python, you'll get >>> [('', '', '', '', '', '10-10-1991', '10', '10', '1')] Are you just looking for it to return '10-10-1010'? True? False? – Dval Oct 28 '15 at 00:54
  • just 10-10-1010' would be nice – Rabbit Oct 28 '15 at 01:18

3 Answers3

4

It looks to me like you need to make use of non-capturing groups.

Here's the thing: in a regular expression, anything inside parentheses () is a captured group - it comes out as one of the items captured in a match.

If you want to use parentheses to group a part of the pattern (e.g. so that you can use | at something lower than the top level), but you don't want the text inside that parenthetical group to be a separate item in the match output, then you want to use a non-capturing group instead.

To do that, where you would have had (foo), instead use (?:foo) - adding the ?: to the beginning. That prevents that group from capturing text in the final match.

Amber
  • 507,862
  • 82
  • 626
  • 550
1

Amber's suggestion is perfectly fine. But may I make a suggestion? Try not to shove all the logic into the regular expression itself. It makes it nigh unreadable, and still doesn't handle the corner cases as written (for example, it accepts February 29th in every year, not just leap years). Don't use regular expressions to do the work of a true parser.

Instead, search for the general form, then parse it with dedicated date parsing code and if it passes parsing, keep it. For example:

import datetime, re

def is_valid_dmy_date(datestr):
    try:
        datetime.datetime.strptime(datestr, '%d-%m-%Y')
    except ValueError:
        return False
    return True

# In Python 3, wrap filter call in list() if you need a real list,
# or just iterate results of filter directly if that's all you need
all_dates = filter(is_valid_dmy_date, re.findall(r'\b\d\d-\d\d-\d{4}\b', article))

You'll note, the regex is dramatically simplified (I added \b zero width assertions so it won't match something like 001-01-200123, but you can remove them if matching dates should occur even without word boundaries). The work is passed to datetime.strptime, which knows what dates really are, so it correctly rejects stuff like the 29th of February, 2011.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
1

re.findallreturns a tuple which contains all results () captures. Your have 9 () in your regular pattern, so you got a tuple with 9 elements. try print format1[0][5] may solve problem in this case or use re.search instead
format1 = re.search('[0-2][0-9]-02-(\d){4}|(([0-2][0-9]|30)-(04|06|09|11)-(\d){4})|(([0-2][0-9]|30|31)-(01|03|05|07|08|10|12)-(\d){4})',article) print format1.group(0)