-1

Suppose there is a file with a multi-line string like:

/Analysis made on 28 september 2011 people who exercise a lot are healthy/

How can I extract the date 28 September 2011 from the entire file or string, regardless of the month in the date, or whether it's capitalized?

DHW
  • 1,157
  • 1
  • 9
  • 24
  • 1
    Hi Shabbu. Please describe your problem a bit better. Are all strings like that you have to substring? Because if it is only for this sentence then the solution is easy and you should do a stringsplit. – Ansjovis86 Sep 29 '19 at 14:22
  • Extract the substring i.e . only date "28 september 2011" from a multiline string or file – Shabbu Pathan Sep 29 '19 at 14:32
  • 1
    `stringr::str_extract(your_string, paste('(?i)\\d+',month.name,'\\d+',collapse='|'))`?? – Onyambu Sep 29 '19 at 14:48
  • 1
    Possible duplicate of [Extract date from given string in r](https://stackoverflow.com/questions/43405615/extract-date-from-given-string-in-r) – Vitali Avagyan Sep 29 '19 at 14:58

1 Answers1

3

I assume you have more than one date you want to extract here, and that you want the result to be date types (if not, just pass them to format() with the strptime() specification you want, e.g. %e %B %Y - but converting to date first will standardize them, because, for example, you have a lowercase month name here).

What I'm doing here is using R's built-in month.name vector of full month names, and making a single regex string out of it that will match any text with any month name surrounded by date and year numbers. We end up with a list of character vectors, one vector for each document string, with all the date strings extracted from them in order, and then I map as_date() to them with the appropriate parse pattern so that they're actually R dates now.

  library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
  library(tidyverse)

  string <- 
    "Analysis made on 28 september 2011 people who exercise a lot are healthy
Another analysis on 6 May 1998 found otherwise"

  pattern <-  
    paste("[:digit:]{1,2}", month.name, "[:digit:]{4}", 
          collapse = "|") %>% 
    regex(ignore_case = TRUE)

  pattern
#> [1] "[:digit:]{1,2} January [:digit:]{4}|[:digit:]{1,2} February [:digit:]{4}|[:digit:]{1,2} March [:digit:]{4}|[:digit:]{1,2} April [:digit:]{4}|[:digit:]{1,2} May [:digit:]{4}|[:digit:]{1,2} June [:digit:]{4}|[:digit:]{1,2} July [:digit:]{4}|[:digit:]{1,2} August [:digit:]{4}|[:digit:]{1,2} September [:digit:]{4}|[:digit:]{1,2} October [:digit:]{4}|[:digit:]{1,2} November [:digit:]{4}|[:digit:]{1,2} December [:digit:]{4}"
#> attr(,"options")
#> attr(,"options")$case_insensitive
#> [1] TRUE
#> 
#> attr(,"options")$comments
#> [1] FALSE
#> 
#> attr(,"options")$dotall
#> [1] FALSE
#> 
#> attr(,"options")$multiline
#> [1] FALSE
#> 
#> attr(,"class")
#> [1] "regex"     "pattern"   "character"

  str_extract_all(string, pattern) %>% 
    map(as_date, tz = "", format = "%e %B %Y")
#> [[1]]
#> [1] "2011-09-28" "1998-05-06"

Created on 2019-09-29 by the reprex package (v0.3.0)

DHW
  • 1,157
  • 1
  • 9
  • 24