5

I'm trying to write a regex expression that can be used to find dates in a string that may be preceded (or followed) by spaces, numbers, text, end-of-line, etc. The expression should handle US date formats that are either

1) Month Name Day, Year - i.e. January 10, 2019 OR
2) mm/dd/yy - i.e. 11/30/19

I found this for Month Name, Day Year

(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}

(thanks to Veverke here Regex to match date like month name day comma and year

and this for mm/dd/yy (and various combinations of m/d/y)

(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])/(?:[0-9]{2})?[0-9]{2} 

(thanks to Steven Levithan and Jan Goyvaerts here https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch04s04.html

I have tried to combine them like this

((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4})|((1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])/(?:[0-9]{2})?[0-9]{2})

and when I search the input string "Paid on 1/1/2019" for "on [regex above]" it does finds the date but not the word "on". The string is found if I just use

(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])/(?:[0-9]{2})?[0-9]{2}

Can anyone see what I'm doing wrong?

Edit

I'm using the c# .net code below:

    string stringToSearch = "Paid on 1/1/2019";
    string searchPattern = @"on ((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4})|((1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])/(?:[0-9]{2})?[0-9]{2})";
    var match = Regex.Match(stringToSearch, searchPattern, RegexOptions.IgnoreCase);


    string foundString;
    if (match.Success)
        foundString= stringToSearch.Substring(match.Index, match.Length);

For example

string searchPattern = @"on ((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4})|((1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])/(?:[0-9]{2})?[0-9]{2})";
stringToSearch = "Paid on Jan 1, 2019";
found = "on Jan 1, 2019" -- worked as expected, found the word "on" and the date

stringToSearch = "Paid on 1/1/2019";
found = "1/1/2019"  -- did not work as expected, found the date but did not include the word "on"

If I reverse the pattern

string searchPattern = @"on ((1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])/(?:[0-9]{2})?[0-9]{2})|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4})"";

stringToSearch = "Paid on Jan 1, 2019";
found = "Jan 1, 2019" -- did not work as expected, found the date but did not include the word "on"

stringToSearch = "Paid on 1/1/2019";
found = "on 1/1/2019" -- worked as expected, found the word "on" and the date

Thanks

Emma
  • 27,428
  • 11
  • 44
  • 69
TedS
  • 53
  • 1
  • 1
  • 5
  • 1
    Seems to work https://regex101.com/r/UUUJSP/1 –  May 09 '19 at 19:02
  • The regex is fine. If you are using java, link me your code. – Paul Lemarchand May 09 '19 at 19:52
  • Judging by the regex difference, you must double all the backslashes: ``\`` > ``\\`` (in string literals, ``\\`` is used to denote one backslash). What is the programming language? – Wiktor Stribiżew May 09 '19 at 20:00
  • Sorry @sln, my question was not exact. The regex does find the date but the word "on" is not part of the result. I'm using c# .net. I will edit my question to clarify. Thanks – TedS May 10 '19 at 14:05
  • 1
    Thanks for the suggestion @Emma to include inputs and outputs. This is my first post (I have read very many other posts) and appreciate suggestions to improve/clarify my question. – TedS May 10 '19 at 15:45
  • Do you want to parse natural language string dates into `DateTime` objects ? – Kunal Mukherjee May 10 '19 at 15:51
  • @Emma if you add the word "on" with a space before the regex pattern in either of your tests ("on [regex for dates]" you will see it gets the dates but not the word "on" and the dates. Kunal - I'm trying to write a method that can be used to extracts dates (with other words before or after the dates) from any input string. Thanks – TedS May 10 '19 at 16:43

1 Answers1

3

Your expression seems to work fine, both of them. If you wish to capture anything before or after your target output, you can simply add two boundaries in the left and right, which would do that for you. For example, please look at this test:

(.*)(((1[0-2]|0?[1-9])\/(3[01]|[12][0-9]|0?[1-9])\/(?:[0-9]{2})?[0-9]{2})|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}))(.*)

where you can for instance add two groups similar to (.*) and wrap your original expression in one group which will do so.

enter image description here

RegEx Descriptive Graph

The graph visualize how your expression works and you might want to test other expressions in this link:

enter image description here

C# Test

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = @"(.*)(((1[0-2]|0?[1-9])\/(3[01]|[12][0-9]|0?[1-9])\/(?:[0-9]{2})?[0-9]{2})|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}))(.*)";
        string input = @"Paid on Jan 1, 2019 And anything else that you wish to have after
Paid on 1/1/2019 And anything else that you wish to have after";
        RegexOptions options = RegexOptions.Multiline;

        foreach (Match m in Regex.Matches(input, pattern, options))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

JavaScript Demo

This JavaScript demo shows that your expression works:

const regex = /(.*)(((1[0-2]|0?[1-9])\/(3[01]|[12][0-9]|0?[1-9])\/(?:[0-9]{2})?[0-9]{2})|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}))(.*)/gm;
const str = `Paid on Jan 1, 2019 And anything else that you wish to have after
Paid on 1/1/2019 And anything else that you wish to have after`;
const subst = `\nGroup 1: $1 \nGroup 2: $2 \nGroup 3: $3 \nGroup 4: $4 `;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

Basic Performance Test

This JavaScript snippet returns runtime of a 1-million times for loop for performance.

const repeat = 1000000;
const start = Date.now();

for (var i = repeat; i >= 0; i--) {
 const string = 'Paid on Jan 1, 2019';
 const regex = /(.*)(((1[0-2]|0?[1-9])\/(3[01]|[12][0-9]|0?[1-9])\/(?:[0-9]{2})?[0-9]{2})|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}))(.*)/gm;
 var match = string.replace(regex, "\nGroup #1: $1\nGroup #2: $2 \n");
}

const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match  ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test.  ");

Improvement

You might want to reduce your capturing groups around month names, and you can simply add all of them in one capturing group, if you wish.

Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    thank you very much - exactly what I need and the samples and test code are very helpful. – TedS May 10 '19 at 17:48
  • How do you just match return the Date? This returns "adfadf January 1, 2021' – Jon Mar 27 '21 at 20:49