0

I am trying to use Regex to parse a series of strings to extract one or more text dates that may be in multiple formats. The strings will look something like the following:

  1. 24 Aug 2016: nno-emvirt010a/b; 16 Aug 2016 nnt-emvirt010a/b nnd-emvirt010a/b COSI-1.6.5
  2. 24.16 nno-emvirt010a/b nnt-emvirt010a/b nnd-emvirt010a/b EI.01.02.03\
  3. 9/23/16: COSI-1.6.5 Logs updated at /vobs/COTS/1.6.5/files/Status_2016-07-27.log, Status_2016-07-28.log, Status_2016-08-05.log, Status_2016-08-08.log

I am not concerned about validating the individual date fields; just extracting the date string. The part I am unable to figure out is how to not match on number sequences that match the pattern but aren’t dates (‘1.6.5’ in ex. (1) and 01.02.03 in ex. (2)) and dates that are part of a file name (2016-07-27 in ex. (3)). In each of these exception cases in my input data, the initial numbers are preceded by either a period(.), underscore (_) or dash (-), but I cannot determine how to use this to edit the pattern syntax to not match these strings.

The pattern I have that partially works is below. It will only ignore the non date matches if it starts with 1 digit as in example 1.

/[^_\.\(\/]\d{1,4}[/\-\.\s*]([1-9]|0[1-9]|[12][0-9]|3[01]|[a-z]{3})[/\-\.\s*]\d{1,4}/ig`
Jim D
  • 1
  • 1

2 Answers2

0

I am not sure about vba check if this works . seems they have given so much options : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch04s04.html

^(?:(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])|↵
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9]))/(?:[0-9]{2})?[0-9]{2}$


^(?:
  # m/d or mm/dd
  (1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])
|
  # d/m or dd/mm
  (3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9])
)
# /yy or /yyyy
/(?:[0-9]{2})?[0-9]{2}$
teju c
  • 336
  • 2
  • 8
0

According to the test strings you've presented, you can use the following regex

See this regex in use here

(?<=[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])

This regex ensures that specific date formats are met and are preceded by nothing (beginning of the string) or by a non-word character (specifically a-z, A-Z, 0-9) or dot .. The date formats that will be matched are:

  • 24 Aug 2016
  • 24.16
  • 9/23/16

The regex could be further manipulated to ensure numbers are in the proper range according to days/month, etc., however, I don't feel that is really necessary.


Edits

Edit 1

Since VBA doesn't support lookbehinds, you can use the following. The date is in capture group 1.

(?:[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])

Edit 2

As per bulbus's comment below

(?:[^\w.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d{2,4})|(?:(?:\d{‌1,2}\/){2}\d{2,4})|(‌​?:\d{2,4}(?:-\d{2}){‌​2})|\d{2}\.\d{2})

Took liberty to edit that a bit.

  1. replaced [^a-zA-Z\d.] with [^\w.], comes with added advantage of excluding dates with _2016-07-28.log
  2. Due to 1 removed trailing condition (?=[^a-zA-Z\d.]).
  3. Forced year digits from \d+ to \d{2,4}

Edit 3

Due to added conditions of the regex, I've made the following edits (to improve upon both previous edits). As per the OP:

The edited pattern above works in all but 2 cases:

  1. it does not find dates with the year first (ex. 2016/07/11)
  2. if the date is contained within parenthesis in the string, it returns the left parenthesis as part of the date (ex. match = (8/20/2016)

Can you provide the edit to fix these?

In the below regexes, I've changed years to \d+ in order for it to work on any year greater than or equal to 0.

See the code in use here

(?:[^\w.]|^)((?:\d{1,2}\s+[A-Z][a-z]{2}\s+\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:\/\d{1,2}){2})|(?:\d+(?:-\d{2}){2})|\d{2}\.\d+)

This regex adds the possibility of dates in the XXXX/XX/XX format where the date may appear first.

The reason you are getting ( as a match before the regex is the nature of the Full Match. You need to, instead, grab the value of the first capture group and not the whole regex result. See this answer on how to grab submatches from a regex pattern in VBA.

Also, note that any additional date formats you need to catch need to be explicitly set in the regex. Currently, the regex supports the following date formats:

  • \d{1,2}\s+[A-Z][a-z]{2}\s+\d+
    • 12 Apr 17
    • 12 Apr 2017
  • (?:\d{1,2}\/){2}\d+
    • 1/4/17
    • 01/04/17
    • 1/4/2017
    • 01/04/2017
  • \d+(?:\/\d{1,2}){2}
    • 17/04/01
    • 2017/4/1
    • 2017/04/01
    • 17/4/1
  • \d+(?:-\d{2}){2}
    • 17-04-01
    • 2017-04-01
  • \d{2}\.\d+ - Although I'm not sure what this date format is even used for and how it could be considered efficient if it's missing month
    • 24.16
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Thank you @jsotola, I've added an edit to use a non-capture group instead. The value sought after is in capture group 1. – ctwheels Sep 19 '17 at 19:08
  • 1
    `(?:[^\w.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d{2,4})|(?:(?:\d{1,2}\/){2}\d{2,4})|(?:\d{2,4}(?:-\d{2}){2})|\d{2}\.\d{2})` Took liberty to edit that a bit. **1** replaced `[^a-zA-Z\d.]` with `[^\w.]`, comes with added advantage of excluding dates with `_2016-07-28.log` **2** Due to `1` removed trailing condition `(?=[^a-zA-Z\d.])`. **3** Forced year digits from `\d+` to `\d{2,4}` – kaza Sep 19 '17 at 19:40
  • Thank you jsotola and @ctwheels for the solution and quick response. As you may have noticed this is my first attempt at using regex, and this has been a tremendous help. The edited pattern above works in all but 2 cases: 1) it does not find dates with the year first (ex. 2016/07/11), and 2) if the date is contained within parenthesis in the string, it returns the left parenthesis as part of the date (ex. match = (8/20/2016. Can you provide the edit to fix these? – Jim D Sep 20 '17 at 13:42
  • @JimD I've added all the edits to the end of my answer (including the new one to fix as per your comment) – ctwheels Sep 20 '17 at 15:27
  • @ctwheels, This is awesome! Thanks again. PS. the 24.16 was a typo. My apologies. – Jim D Sep 20 '17 at 17:42