I think the best thing would be to know the date format used in the given string prior to reading the file and then test if the date format is always as expected. However, as the OP states this is not the case. Here is a not exhaustive list of date formats, but it should give you an impression, that it can be tedious work to figure out a regex that only allows valid dates. Also, format guessing can make make your scripts somewhat unpredictable for someone who does not understand in detail how the guessing is done.
If you still think you need to use regex for different date formats try to design it in a way that makes it clear to the reader which one format is given priority:
(?:format1)|(?:format2)|...|(?:formatN)
In this case format1 would have priority over
There are also quite nice regexes on https://stackoverflow.com/a/15504877/6018688 that do some nice date validity checking these formats even accounting for leap years dd/mm/yyyy
, dd-mm-yyyy
or dd.mm.yyyy
.
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
and from the same Question, a different answer with month names:
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
I think you get an impression now, how convoluted it can be to write a regex that actually does what you intend to do perfectly. I would really try to keep the allowed dates to a minimum and aim for a quite restrictive regex. In your example, you give strings only containing dates (and spaces), nothing else. If this is also the case, you should try to math the whole string with "^yourregex$"
, if you want to allow for spaces at the beginning and end of string use "^\s*yourregex\s*$"
. Since you have one example with spaces at the beginning of the string, i use the latter for further development.
In your case I would start with only years:
"^\\s*(?:\\d{4})\\s*$"
Then allow the other stuff mm-dd-YY (no checking if it is indeed a valid date or maybe "33-13-2016", but would also allow 2 digit year number)
"(?:\\d{1,2}[/.-]\\d{1,2}[/.-](?:\\d{4}|\\d{2})"
and if you want to allow space between the delimiters:
"(?:\\d{1,2}\\s*[/.-]\\s*\\d{1,2}\\s*[/.-]\\s*\\d{4})"
Then formats with written or abbreviated month names:
"(\\d{1,2}\\s*[/.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)\\s*[/.-]?\\s*(?:'?\\d{2}|\\d{4}))"
Put together:
"^\\s*(?:\\d{4}$)|(?:\\d{1,2}\\s*[/.-]\\s*\\d{1,2}\\s*[/.-]\\s*\\d{4})|(\\d{1,2}\\s*[/.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)\\s*[/.-]?\\s*(?:'?\\d{2}|\\d{4}))\\s*$"
This way you can chain as many formats as you wish.
Please compare the following regex with a yours to check the behavior on different input strings. I added word boundary \b
constraints, since you used str_extract_all I assume there can be multiple dates in the same string.
string = "only a year 1985. No space 2.Jan.2016. 2. Jan. 2016. 2. Jan. '16 2/1/16 02/01/2016 19855 ID1985A 2. Jan 2016 2.. Jan 2016 1January2016 2-Jan.-2016 2-Jan-2016 2.\tJan.\t2016"
pattern = "(\\d{1,2}[/\\.-][ ]?)?(\\d{1,2}[ ]*[/\\.-]|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)[ ]*[']?\\d{2,4}"
p="\\s*(?:\\b\\d{4}\\b)|(?:\\b\\d{1,2}\\s*[/\\.-]\\s*\\d{1,2}\\s*[/\\.-]\\s*(?:\\d{4}|\\d{2})\\b)|\\b\\d{1,2}\\s*[/\\.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|(?:Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec).?)\\s*[/\\.-]?\\s*(?:\\d{4}|'?\\d{2})\\b\\s*"
str_extract_all(string, pattern=pattern)
str_extract_all(string, pattern=p)
A word of warning: When allowing multiple versions of different formats with spaces, you allow for variances that make it hard to guarantee that only dates are matched and not some other numeric values in the text.
Escaping the dot in character group is unnecessary as in [\.] should only be [.]; except if you also want to allow a backslash as delimiter of the between the day\mont\year.
When the input format is variable, space can also be a tab \t
so replacing [ ]
with \s
(which matches any space character except line terminators like \n
) seems to be a good idea.