5

I am relatively new to regular expressions and I am running into a dead end. I have a data frame with a column that looks like this:

year1
GMM14_2000_NGVA
GMM14_2001_NGVA
GMM14_2002_NGVA
...
GMM14_2014_NGVA

I am trying to extract the year in the middle of the string (2000,2001, etc). This is my code thus far

gsub("[^0-9]","",year1))

Which returns the number but it also returns the 14 that is part of the string:

142000
142001

Any idea on how to exclude the 14 from the pattern or how to extract the year information more efficiently?

Thanks

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
asado23
  • 366
  • 1
  • 7
  • 20

5 Answers5

10

Use the following gsub:

s  = "GMM14_2002_NGVA"
gsub("^[^_]*_|_[^_]*$", "", s)

See IDEONE demo

The regex breakdown:

Match...

  • ^[^_]*_ - 0 or more characters other than _ from the start of string and a_
  • | - or...
  • _[^_]*$ - a _ and 0 or more characters other than _ to the end of string

and remove them.

As an alternative,

library(stringr)
str_extract(s,"(?<=_)\\d{4}(?=_)")

Where the Perl-like regex matches 4-digit substring that is enclosed with underscores.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Note that your regex in gsub matches every character that is not a digit and removes it from the input. That is why you had all digits from input left in the result. – Wiktor Stribiżew Oct 01 '15 at 15:17
7

Using stringi package, the following is one way. The assumption is that year is in 4 digits. Since you specify the digit number, this is pretty straightfoward.

library(stringi)

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

stri_extract_last(x, regex = "\\d{4}")
#[1] "2000" "2001"

or

stri_extract_first(x, regex = "\\d{4}")
#[1] "2000" "2001"
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • There is one potential issue with this regex: since it does not account for the context, any last or first 4-digit sequence will be extracted. – Wiktor Stribiżew Oct 01 '15 at 15:21
  • @stribizhev Sure thing. Seeing the patterns in the sample data, I decided to choose this way. If there are some other patterns, this is not the way to go. Thank you for leaving the comment. :) – jazzurro Oct 01 '15 at 15:22
  • 1
    You could also use the direct function `stri_extract_last_regex(x, "\\d+")`. Should be faster since it avoids some checks – Rich Scriven Oct 01 '15 at 17:35
  • @RichardScriven Long time. Yes, I agree with you! Thank you very much for leaving this comment. – jazzurro Oct 01 '15 at 23:19
2

Another option in base-R would be strsplit using @jazzurro 's data:

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

vapply(strsplit(x, '_'), function(x) x[2], character(1))
[1] "2000" "2001"

strsplit splits each element of the x vector on the underscores _ and outputs a list of the same length as length x. Using vapply we collect the second element of each vector in the list i.e. the year between underscores.

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
2

You may use sub.

sub(".*_(\\d{4})_.*", "\\1", x)

or

devtools::install_github("Avinash-Raj/dangas")
library(dangas)
extract_a("_", "_", x)

This would extract all the chars present in-between the start and end delimiters. Here the start and end delimiter is underscore.

syntax:

extract_a(start, end, string)
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
1

I never used R but had deep experience with regexps.

Idiomatically proper way would be to use matching.

For R it should be regmatches:

Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from regexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1]  "a"  "a"  "aa"

In you case it should be:

m <- regexpr("\d{4}", year1, perl=TRUE)
regmatches(year1, m)

In case if you can have another 4 digits in a row in the same string you can use non capturing groups. Probably like this:

"(?:_)\d{4}(?:_)"

Sorry, have no chance to test all this in R.

Community
  • 1
  • 1
Alexander Trakhimenok
  • 6,019
  • 2
  • 27
  • 52