1

I am trying to extract only date part from a bunch of unstructured text.

Issue is, the date could be in any of the following formats:

  • Jan. 16 or Jan 16 2017 (for January 16th, 2017)
  • Januray 2, 2017
  • 02/01/2017 (dd/mm/yyyy)
  • 01/02/2017 (mm/dd/yyyy)
  • 01-02-17 (mm-dd-yy)

Sample Text:

x <- "There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"

What I was trying is one of the other options (from the examples in this answer):

gsub(".*[(]|[)].*", "", string)

Any other generalized possibility?

Madhu Sareen
  • 549
  • 1
  • 8
  • 20
  • Is 'Januray' a typo or is it actually a scenario you need to account for? – KenHBS Sep 27 '17 at 16:23
  • 2
    May be [THIS](https://regex101.com/r/zYZAYI/1) can be a start. – Gurmanjot Singh Sep 27 '17 at 17:11
  • I'm not sure why you're trying to match parentheses when your sample text doesn't contain any. Secondly, escaping special characters with square brackets is generally bad practice because that's also how you create character sets, and some special characters don't get escaped that way or take on different meanings (`^` for example). Just use `\\`. – CAustin Sep 27 '17 at 17:55
  • @KenS. Thanks for pointing out. yes it is a Typro. – Madhu Sareen Sep 27 '17 at 18:26
  • @CAustin - actually this is just a SAMPLE TEXT from a large document which has parenthesis also. – Madhu Sareen Sep 27 '17 at 18:27
  • I understand that, but your sample text has nothing to do with the regex pattern you posted. What's the point of a sample that provides no relevant information? – CAustin Sep 27 '17 at 18:39

1 Answers1

3

First of all, Without knowing the date format, for this instance 02/03/2002 you can not tell whether a day is a day and a month is a month.... and in case year can be 2 digit too... eg dd/mm/yy or yy/mm/dd or mm/yy/dd ... you can not say which one is day, which one is month and which one is year...

Taking all these things into account... there could be strings that may come from third party on which you may not have any way to determine the format ... thus no solution can guarantee to define day or month or year for you.

But it is possible to identify all the digit patterns that you have mentioned. The following solution will give you three group. You will get the three part of your date for all the formats that you have mentioned in group 1,2 and 3. You will have to analyze / guess a way to figure which one is day, which one is month, and which one is year. That can't be covered by regex.

Taking all these facts into account, you may try the following regex:

((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\.?)|(?:\d{1,2}))[\/ ,-](\d{1,2})(?:[\/ ,-]\s*(\d{4}|\d{2}))?

Regex 101 Demo

Sample Source ( run here ):

library(stringr)
str<-"Jan. 16  bla bla bla Jan 16 2017 bla bla bla January 2, 2017 bla bla bla 02/01/2017 bla bla bla 01/02/2017 bla bla bla 01-02-17 bla bla bla jan. 16 There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"
patt <- "(?i)((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\\.?)|(?:\\d{1,2}))[\\/ ,-](\\d{1,2})(?:[\\/ ,-]\\s*(\\d{4}|\\d{2}))?"
result<-str_match_all(str,patt)
result

Sample Output:

      [,1]              [,2]      [,3] [,4]  
 [1,] "Jan. 16"         "Jan."    "16" ""    
 [2,] "Jan 16 2017"     "Jan"     "16" "2017"
 [3,] "January 2, 2017" "January" "2"  "2017"
 [4,] "02/01/2017"      "02"      "01" "2017"
 [5,] "01/02/2017"      "01"      "02" "2017"
 [6,] "01-02-17"        "01"      "02" "17"  
 [7,] "jan. 16"         "jan."    "16" ""    
 [8,] "Jan 2, 2017"     "Jan"     "2"  "2017"
 [9,] "02/01/2017"      "02"      "01" "2017"
[10,] "01/02/17"        "01"      "02" "17"  
[11,] "Jan. 16"         "Jan."    "16" ""    
[12,] "01-02-2017"      "01"      "02" "2017"
Mustofa Rizwan
  • 10,215
  • 2
  • 28
  • 43
  • Rizwan, thanks. but how will it matter? for example 2nd feb 2002 can be written either ways 02/02/2002 (dd/mm/yyyy or mm/dd/yyyy) or 02/02/02 (dd/mm/yy or yy/dd/mm or mm/yy/dd) it all the same, right? and a logic should be able to match it and bring into approprite common final format. right? – Madhu Sareen Sep 28 '17 at 06:28
  • 1
    What about 02/03/2002 then is it 2nd of March or 3rd of February ? – Mustofa Rizwan Sep 28 '17 at 06:35
  • 2
    The point is .... by the above regex you will get 3 parts for sure... it is definitely going to be upto you ... to decide what logic you want to put over it to determine day, month or year – Mustofa Rizwan Sep 28 '17 at 06:42
  • i agree. but there should be some way out? :-( – Madhu Sareen Sep 28 '17 at 07:40
  • 1
    let's forget all regex or everything .. just think simple i tell you 02/03/01 but dont tell you date format.... what do you get from it ? There is a way out, the party who is sending you the data **may** know the format... they have got to share it ... – Mustofa Rizwan Sep 28 '17 at 07:49