I have text (news) data and want to extract dates from the text. Dates can be in any format, such as April 10 2018, 10-04-2018 , 10/04/2018, 2018/04/10, 04.10.2018, etc.
An example string would be:
My Friend is coming on july 10 2018 or 10/07/2018
I have text (news) data and want to extract dates from the text. Dates can be in any format, such as April 10 2018, 10-04-2018 , 10/04/2018, 2018/04/10, 04.10.2018, etc.
An example string would be:
My Friend is coming on july 10 2018 or 10/07/2018
we extract it using str_extract
and then with anydate
get the format
library(anytime)
library(stringr)
anydate(str_extract_all(str1, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")[[1]])
#[1] "2018-07-10" "2018-10-07"
str1 <- "My Friend is coming on july 10 2018 or 10/07/2018"
parsedate works well for these things.
library(parsedate)
dates = c("April 10 2018", "10-04-2018", "10/04/2018", "2018/04/10", "04.10.2018")
parsedate::parse_date(dates)
[1] "2018-04-10 UTC" "2018-10-04 UTC" "2018-10-04 UTC" "2018-04-10 UTC" "2018-10-04 UTC"
The parsedate is a nice package but it fails with the following string
txt = "Live coverage as American payrolls data shows big rise in unemployment, after composite PMI data shows UK business activity sunk to a record low in March following the Covid-19 lockdown"
> parsedate::parse_date(txt) [1] "2020-03-19 UTC"
[1] "2020-03-19 UTC"