0

I have text (news) data and want to extract dates from the text. Dates can be in any format, such as April 10 2018, 10-04-2018 , 10/04/2018, 2018/04/10, 04.10.2018, etc.

An example string would be:

My Friend is coming on july 10 2018 or 10/07/2018

socialscientist
  • 3,759
  • 5
  • 23
  • 58
rachit
  • 27
  • 1
  • 7
  • 3
    What do you have tried so far? – patL May 03 '18 at 11:55
  • 1
    There is no miracle solution, you need to list all the formats that you can have in your text, and tackle each format. – byouness May 03 '18 at 11:57
  • 3
    Try to start with the regex's from [here](https://www.regular-expressions.info/dates.html). If you get stuck, post where you get stuck. – phiver May 03 '18 at 12:02
  • Have a look at the `anytime` package. The `anydate` function might be useful – Mike H. May 03 '18 at 12:09
  • Note that if you don't know the format, may cases will be ambiguous (is 3/4/18 April 3rd or March 4th?). – iod May 03 '18 at 13:02

3 Answers3

9

we extract it using str_extract and then with anydate get the format

library(anytime)
library(stringr)
anydate(str_extract_all(str1, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")[[1]])
#[1] "2018-07-10" "2018-10-07"

data

str1 <- "My Friend is coming on july 10 2018 or 10/07/2018"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @rachit Here are matching one or more alpha numeric (`[[:alnum:]]+`) followed by a zero or more space + forward slash (`[ /]*`) followed by two digits then space or slash and the four digits. This is basically match the `july 10 2018` or `10/07/2018` and is converted to Date class with `anydate` – akrun May 05 '18 at 06:08
1

parsedate works well for these things.

library(parsedate)

dates = c("April 10 2018", "10-04-2018", "10/04/2018", "2018/04/10", "04.10.2018")
parsedate::parse_date(dates)

[1] "2018-04-10 UTC" "2018-10-04 UTC" "2018-10-04 UTC" "2018-04-10 UTC" "2018-10-04 UTC"
Lucas
  • 91
  • 1
  • 5
1

The parsedate is a nice package but it fails with the following string

txt = "Live coverage as American payrolls data shows big rise in unemployment, after composite PMI data shows UK business activity sunk to a record low in March following the Covid-19 lockdown" 
> parsedate::parse_date(txt) [1] "2020-03-19 UTC"
[1] "2020-03-19 UTC"
sotnik
  • 61
  • 1
  • 8