2

I am trying to create a regular expression in R that will search for dates within some text. Since I cannot control the actual date format, I am trying to "catch" all the possible dd/mm/yy formats (one or two digit months, two or four digit years, optional 1 or two digit days, with a range of separators ("/", "-", "."), possibly containing spaces).

My regular expression so far is:

pattern = "(\\d{0,2}[/\\.-])?[ ]?(\\d{1,2}[ ]*[/\\.-]|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)[ ]*[']?\\d{2,4}"

This seems to work on most formats, but it contains a bug that I find hard to understand:

str_extract_all("09/11 /1985", pattern = pattern) # returns: "09/11 /1985"
str_extract_all(" 09/11 /1985", pattern = pattern) # returns: c("09/11",  "1985")

This sounds extremely weird. Since I am not including lookarounds, the extra space in the start should make no difference. The results say otherwise. What am I doing wrong?

  • Thanks for your suggestion. I did not know about the package. I just installed it and it seems to be very flexible. Do you know if I can use it somehow to search for dates within my text? – GerasimosPanagiotakopoulos Jul 20 '16 at 13:43
  • No, I don't think you can. The package just helps parsing dates stored in different formats. – RHertel Jul 20 '16 at 13:48
  • add some example strings and expected results. wouldn't `pattern <- '\\d+[ /.-]+\\w+[ /.-]+\\d+'` be good enough? – rawr Jul 20 '16 at 13:49
  • 1
    In your second case, the leading space is matching your day-pattern, then `09` is matches the month and `11` matches the year – Sebastian Proske Jul 20 '16 at 13:55
  • I am actually trying to read some CVs. It is impossible for me to predict what format the writers will use, however I am trying to include all the "reasonable" combinations that I might encounter. The pattern you suggest will not yield hits for simple years (2000) or months and years (05/95 or 8/2004) – GerasimosPanagiotakopoulos Jul 20 '16 at 14:00
  • @Sebastian Proske thanks for the insightful comment. This is actually what my problem is. However, I still need to detect a date like "05 /03/2012". Do you have any ideas how I can fix that? – GerasimosPanagiotakopoulos Jul 20 '16 at 14:02
  • If you just extract the matched string it's obvious that you'll get the same format of the input. To change the matched string you need a replace function. – logi-kal Jul 20 '16 at 14:04
  • @horcrux thanks for your reply. I am not complaining about the format of my extraction, I am very happy with "09/11 /1985". It is the c("09/11", "1985") result (two dates instead of one) that annoys me. – GerasimosPanagiotakopoulos Jul 20 '16 at 14:11
  • Oh, that seems to have been alredy solved by @SebastianProske. Try just adding ` *` (space and star) at the beginning (and maybe at the end?) of your regex. – logi-kal Jul 20 '16 at 14:17
  • See `guess_formats` in the lubridate package. – G. Grothendieck Jul 20 '16 at 14:42

2 Answers2

2

The problems lies in the first part of your regex, where you probably try to match the days: (\\d{0,2}[/\\.-])?[ ]? It is optionally matching 0 to 2 days followed by one of your delimiters. Then it's optionally matching a space.

In the case of 09/11 /1985 this part matches the leading space, leaving 09 to be matched as month and 11 as year.

To get rid of this behaviour, you should move the space into the optional group. You might also want to match 1 or 2 digits, otherwise it will match leading delimiters.

So I would rewrite this first part to (\\d{1,2}[/\\.-][ ]?)?

There are a few other points you could improve, e.g.:

  • January|Jan|Jan\\. is the same as Jan(?:\\.|uary)?
  • consider using non capturing groups
Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37
  • I am interested in the "non capturing groups part of your comment". After some testing, I reached the conclusion that the () does not capture anything, but it works as it works in math. Have I got something wrong? – GerasimosPanagiotakopoulos Jul 20 '16 at 14:21
  • You might get some insight [here](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group). – Sebastian Proske Jul 20 '16 at 14:24
1

I think the best thing would be to know the date format used in the given string prior to reading the file and then test if the date format is always as expected. However, as the OP states this is not the case. Here is a not exhaustive list of date formats, but it should give you an impression, that it can be tedious work to figure out a regex that only allows valid dates. Also, format guessing can make make your scripts somewhat unpredictable for someone who does not understand in detail how the guessing is done.

If you still think you need to use regex for different date formats try to design it in a way that makes it clear to the reader which one format is given priority:

(?:format1)|(?:format2)|...|(?:formatN)

In this case format1 would have priority over

There are also quite nice regexes on https://stackoverflow.com/a/15504877/6018688 that do some nice date validity checking these formats even accounting for leap years dd/mm/yyyy, dd-mm-yyyy or dd.mm.yyyy.

^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

and from the same Question, a different answer with month names:

^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

I think you get an impression now, how convoluted it can be to write a regex that actually does what you intend to do perfectly. I would really try to keep the allowed dates to a minimum and aim for a quite restrictive regex. In your example, you give strings only containing dates (and spaces), nothing else. If this is also the case, you should try to math the whole string with "^yourregex$", if you want to allow for spaces at the beginning and end of string use "^\s*yourregex\s*$". Since you have one example with spaces at the beginning of the string, i use the latter for further development.

In your case I would start with only years:

"^\\s*(?:\\d{4})\\s*$"

Then allow the other stuff mm-dd-YY (no checking if it is indeed a valid date or maybe "33-13-2016", but would also allow 2 digit year number)

"(?:\\d{1,2}[/.-]\\d{1,2}[/.-](?:\\d{4}|\\d{2})"

and if you want to allow space between the delimiters:

"(?:\\d{1,2}\\s*[/.-]\\s*\\d{1,2}\\s*[/.-]\\s*\\d{4})"

Then formats with written or abbreviated month names:

"(\\d{1,2}\\s*[/.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)\\s*[/.-]?\\s*(?:'?\\d{2}|\\d{4}))"

Put together:

"^\\s*(?:\\d{4}$)|(?:\\d{1,2}\\s*[/.-]\\s*\\d{1,2}\\s*[/.-]\\s*\\d{4})|(\\d{1,2}\\s*[/.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)\\s*[/.-]?\\s*(?:'?\\d{2}|\\d{4}))\\s*$"

This way you can chain as many formats as you wish.

Please compare the following regex with a yours to check the behavior on different input strings. I added word boundary \b constraints, since you used str_extract_all I assume there can be multiple dates in the same string.

string = "only a year 1985. No space 2.Jan.2016. 2. Jan. 2016. 2. Jan. '16 2/1/16 02/01/2016 19855 ID1985A 2. Jan 2016   2.. Jan 2016 1January2016 2-Jan.-2016 2-Jan-2016 2.\tJan.\t2016"
pattern = "(\\d{1,2}[/\\.-][ ]?)?(\\d{1,2}[ ]*[/\\.-]|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)[ ]*[']?\\d{2,4}"
p="\\s*(?:\\b\\d{4}\\b)|(?:\\b\\d{1,2}\\s*[/\\.-]\\s*\\d{1,2}\\s*[/\\.-]\\s*(?:\\d{4}|\\d{2})\\b)|\\b\\d{1,2}\\s*[/\\.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|(?:Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec).?)\\s*[/\\.-]?\\s*(?:\\d{4}|'?\\d{2})\\b\\s*"
str_extract_all(string, pattern=pattern)
str_extract_all(string, pattern=p)

A word of warning: When allowing multiple versions of different formats with spaces, you allow for variances that make it hard to guarantee that only dates are matched and not some other numeric values in the text.

Escaping the dot in character group is unnecessary as in [\.] should only be [.]; except if you also want to allow a backslash as delimiter of the between the day\mont\year. When the input format is variable, space can also be a tab \t so replacing [ ] with \s (which matches any space character except line terminators like \n) seems to be a good idea.

Community
  • 1
  • 1
fabianegli
  • 2,056
  • 1
  • 18
  • 35
  • Thank you very much for your reply. I actually like your approach very much. However, I am a little confused about how I should implement it. Since the possible date combinations are A LOT, it's quite tedious to implement a format for each one. That's why I tried to use a "single" regex, to benefit from the regex syntax. I do admit though that your idea is considerably easier to understand and maintain, and maybe at the end of the day, that is all that matters. – GerasimosPanagiotakopoulos Jul 20 '16 at 15:35
  • x = "abc.def" str_extract_all(x, ".") # result: "a" "b" "c" "." "d" "e" "f" str_extract_all(x, "\\.") # result: "." I don't see how escaping the dot is unnecessary – GerasimosPanagiotakopoulos Jul 20 '16 at 15:37
  • 1
    it is only unnecessary in character groups like in [/.-] the ones not in a character group need to be commented the way you did. – fabianegli Jul 20 '16 at 15:41
  • 1
    @GerasimosPanagiotakopoulos you could for example use one format to match all of type yyyy, one for mm/yyyy and for dd/mm/yyyy, one for those with months as in Jan. and one for epoch time. – fabianegli Jul 20 '16 at 15:47
  • Thank you very much for your update. I really appreciate your advice. I will definitely consider changing my regex to something that can be maintained more easily. I am afraid I can't control the input though. Since my goal is to process human - written text (no prompts allowed), it is impossible for me to control the input, I am just trying to do as well as possible. – GerasimosPanagiotakopoulos Jul 21 '16 at 09:00