-1

I want to extract data from 500+ files which look like that:

1.   Site Identification of the GNSS Monument

     Site Name                : Aeroport du Raizet -LES ABYMES - Météo France
     Four Character ID        : ABMF
     Monument Inscription     : NONE
     IERS DOMES Number        : 97103M001
     CDP Number               : NONE
     Monument Description     : INOX TRIANGULAR PLATE ON TOP OF METALLIC PILAR
       Height of the Monument : 2.0 m
       Monument Foundation    : ROOF
       Foundation Depth       : 4.0 m
     Marker Description       : TOP AND CENTRE OF THE TRIANGULAR PLATE
     Date Installed           : 2008-07-15T00:00Z

And I'm looking for Date Installed which comes in two different formats: CCYY-MM-DDThh:mmZ or CCYY-MM-DD. Right now I'm using pattern like this: date_installed = re.findall("Date Installed\s*:\s*(.*?)T.*$", contents, re.MULTILINE) but this only gets dates in CCYY-MM-DDThh:mmZ.

How can I modify my regex to extract both date formats without using | operator?

colidyre
  • 4,170
  • 12
  • 37
  • 53

2 Answers2

0

The problem by using re.findall() is that you need non-capturing groups for groups that should not be in the result. So the regex /Date Installed\s*:\s*(.*?(?:T.*Z)?)$/ or /Date Installed\s*:\s*(.*?)(?:T.*Z)?$/ should do the trick, see:

re.findall("Date Installed\s*:\s*(.*?(?:T.*Z)?)$", contents, re.MULTILINE)

This would give you the whole date in form of CCYY-MM-DDThh:mmZ or CCYY-MM-DD.If you are interested only in the CCYY-MM-DD part, then simply move the non-capturing group out as shown in the second regex above:

re.findall("Date Installed\s*:\s*(.*?)(?:T.*Z)?$", contents, re.MULTILINE)

See the Python docs:

(?:...)

A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

colidyre
  • 4,170
  • 12
  • 37
  • 53
0
re.findall("\d{4}\-[01]\d-[0-3]\dT[0-6]\d:\d{2}Z|\d{4}\-[01]\d-[0-3]\d", contents, re.MULTILINE)

This works and doesn't return a tuple in my testing.

Challe
  • 599
  • 4
  • 19