I'm programming in R and I have a dataset like this:
Date
"mrt 2015"
"2012-06-22"
"2012 in Munchen"
"1998?"
"02-2012"
"02-01-1990"
..
How do I retrieve the four numeric values in a row (2015, 2012, 2012, 1998, ..)?
I'm programming in R and I have a dataset like this:
Date
"mrt 2015"
"2012-06-22"
"2012 in Munchen"
"1998?"
"02-2012"
"02-01-1990"
..
How do I retrieve the four numeric values in a row (2015, 2012, 2012, 1998, ..)?
You just need to capture the group of 4 numbers anywhere in your string:
sub(".*(\\d{4}).*", "\\1", your_strings)
#[1] "2015" "2012" "2012" "1998" "2012" "1990"
Explanation: .*
means anything 0 or more times, then you put what you want to capture in between bracket (so 4 digits: \\d{4}
) then again, anything 0 or more times (.*
).
We can use str_extract
to get the numbers if they occur at the beginning of the string or else return NA
library(stringr)
as.integer(str_extract(df1$Date, "^\\d{4}"))
#[1] 2015 2012 2012 1998
Based on the OP's edited dataset, if the 4 digit number occurs anywhere in the string, we remove the ^
which implies beginning of string and use only the pattern \\d{4}
i.e. 4 digit number
as.integer(str_extract(df1$Date, "\\d{4}"))
#[1] 2015 2012 2012 1998 2012 1990
Note that this is very specific i.e. if there is an element that doesn't have the pattern, it returns NA
as.integer(str_extract(c('mrt 2015', 'mr5', '201-01', '02-01-1990', '2012'), '\\d{4}'))
#[1] 2015 NA NA 1990 2012
Or a base R
option is regmatches/regexpr
as.integer(regmatches(df1$Date, regexpr("\\d{4}", df1$Date)))
#[1] 2015 2012 2012 1998 2012 1990