1

I have some data in this format:

#> # A tibble: 2 × 2
#>   record id                                                           
#>    <int> <chr>                                                        
#> 1      1 "<a href=\"https://www.example.com/dir1/dir2/8379\">8379</a>"
#> 2      2 "<a href=\"https://www.example.com/dir1/dir2/8179\">8179</a>"

I would like to use stringr to be left with just the part of the string between ">" and "<".

So my desired output would be:

#> # A tibble: 2 × 2
#>   record id                                                           
#>    <int> <chr>                                                        
#> 1      1 "8379"
#> 2      2 "8179"

I have tried using str_match:

str_match(df$id, pattern = ">(....)<") 

and the second column is what I'm after:

#>      [,1]     [,2]  
#> [1,] ">8379<" "8379"
#> [2,] ">8179<" "8179"

How do I know use it in say a mutate command to change a column in the dataframe?

Tidyverse solutions preferred, but open to all answers.

Code for data entry below.

library(tidyverse)
df <-  tibble::tribble(
  ~record,                                                           ~id,
       1L, "<a href=\"https://www.example.com/dir1/dir2/8379\">8379</a>",
       2L, "<a href=\"https://www.example.com/dir1/dir2/8179\">8179</a>"
  )
df

str_match(df$id, pattern = ">(....)<") 
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
Jeremy K.
  • 1,710
  • 14
  • 35

1 Answers1

0

You can use str_extract() with a regex. Use a lookbehind to look for the character(s) behind the text you're looking for, and a lookahead for the character(s) ahead of it. The code:

df %>%
  mutate(id = str_extract(id, "(?<=\\>)(.*)(?=\\<)"))

#   record   id   
#   <dbl> <chr>
# 1      1 8379 
# 2      2 8179 
Mark
  • 7,785
  • 2
  • 14
  • 34