0

The R gsub() syntax is so difficult to me ! Could you, please, help me to extract, for example, "DA VINCI" from "16. DA VINCI_RETOUR" ?

I've already tried gsub("_.+$", "", x) but it just removes what is after the "_" and I would like also to remove what is before the ". " !

Thank you so much for your help !

3 Answers3

2

Here is one option with capture group to match the pattern of word (\\w+) followed by space and another word as a group and replace with the backreference of the capture group (\\1)

sub("^\\d+\\.\\s+(\\w+\\s+\\w+)_.*", "\\1", str1)

data

str1 <- "16. DA VINCI_RETOUR" 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you very much for your answer ! It works perfectly on "16. DA VINCI_RETOUR" but when I try to use it on an other example like 7. TILLEUL_RETOUR your code doesn't work anymore ! :( –  Oct 01 '19 at 15:10
  • 1
    @sallyb Please check the pattern you showed int he example and in the new one. IN the first there is `DA` followed by space. In the second one, there is not. regex works with patterns – akrun Oct 01 '19 at 15:11
2

.* takes everything at the beginning, \\. matches ., (.*) matches everything until and stores it in \\1 _ and .* removes the rest.

x  <- "16. DA VINCI_RETOUR"
sub(".*\\. (.*)_.*", "\\1", x)
#[1] "DA VINCI"

x  <- "7. TILLEUL_RETOUR"
sub(".*\\. (.*)_.*", "\\1", x)
#[1] "TILLEUL"
GKi
  • 37,245
  • 2
  • 26
  • 48
1

An alternative that uses strsplit:

gsub("\\d+\\.\\s","",
      strsplit(the_string,"_")[[1]][1])
[1] "DA VINCI"

Data:

the_string <- "16. DA VINCI_RETOUR"
NelsonGon
  • 13,015
  • 7
  • 27
  • 57