I assume you have more than one date you want to extract here, and that you want the result to be date types (if not, just pass them to format()
with the strptime()
specification you want, e.g. %e %B %Y
- but converting to date first will standardize them, because, for example, you have a lowercase month name here).
What I'm doing here is using R's built-in month.name
vector of full month names, and making a single regex string out of it that will match any text with any month name surrounded by date and year numbers. We end up with a list of character vectors, one vector for each document string, with all the date strings extracted from them in order, and then I map as_date()
to them with the appropriate parse pattern so that they're actually R dates now.
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(tidyverse)
string <-
"Analysis made on 28 september 2011 people who exercise a lot are healthy
Another analysis on 6 May 1998 found otherwise"
pattern <-
paste("[:digit:]{1,2}", month.name, "[:digit:]{4}",
collapse = "|") %>%
regex(ignore_case = TRUE)
pattern
#> [1] "[:digit:]{1,2} January [:digit:]{4}|[:digit:]{1,2} February [:digit:]{4}|[:digit:]{1,2} March [:digit:]{4}|[:digit:]{1,2} April [:digit:]{4}|[:digit:]{1,2} May [:digit:]{4}|[:digit:]{1,2} June [:digit:]{4}|[:digit:]{1,2} July [:digit:]{4}|[:digit:]{1,2} August [:digit:]{4}|[:digit:]{1,2} September [:digit:]{4}|[:digit:]{1,2} October [:digit:]{4}|[:digit:]{1,2} November [:digit:]{4}|[:digit:]{1,2} December [:digit:]{4}"
#> attr(,"options")
#> attr(,"options")$case_insensitive
#> [1] TRUE
#>
#> attr(,"options")$comments
#> [1] FALSE
#>
#> attr(,"options")$dotall
#> [1] FALSE
#>
#> attr(,"options")$multiline
#> [1] FALSE
#>
#> attr(,"class")
#> [1] "regex" "pattern" "character"
str_extract_all(string, pattern) %>%
map(as_date, tz = "", format = "%e %B %Y")
#> [[1]]
#> [1] "2011-09-28" "1998-05-06"
Created on 2019-09-29 by the reprex package (v0.3.0)