18

I'm looking to extract the year from a string. This always comes after an 'X' and before "." then a string of other characters.

Using stringr's str_extract I'm trying the following:

year = str_extract(string = 'X2015.XML.Outgoing.pounds..millions.'
                 , pattern = 'X(\\d{4})\\.')

I thought the brackets would define the capture group, returning 2015, but I actually get the complete match X2015.

Am I doing this correctly? Why am i not trimming "X" and "."?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Preston
  • 7,399
  • 8
  • 54
  • 84

3 Answers3

27

The capture group is irrelevant in this case. The function str_extract will return the whole match including characters before and after the capture group.

You have to work with lookbehind and lookahead instead. Their length is zero.

library(stringr)
str_extract(string = 'X2015.XML.Outgoing.pounds..millions.',
            pattern = '(?<=X)\\d{4}(?=\\.)')
# [1] "2015"

This regex matches four consecutive digits that are preceded by an X and followed by a ..

Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
10

I believe the most idiomatic way is to use str_match:

str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
          pattern = 'X(\\d{4})\\.')

Which returns the complete match followed by capture groups:

     [,1]     [,2]  
[1,] "X2015." "2015"

As such the following will do the trick:

str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
          pattern = 'X(\\d{4})\\.')[2]
Sam De Meyer
  • 2,031
  • 1
  • 25
  • 32
8

Alternatively, you can use gsub:

string = 'X2015.XML.Outgoing.pounds..millions.'

gsub("X(\\d{4})\\..*", "\\1", string)
# [1] "2015"

or str_replace from stringr:

library(stringr)
str_replace(string, "X(\\d{4})\\..*", "\\1")
# [1] "2015"
acylam
  • 18,231
  • 5
  • 36
  • 45