Efficient extraction of number in middle of semi-irregular text string

Question

I would like to extract a one digit number from the middle of a text string that has small variations. The number of characters before the desired digit is sometimes 4 long and sometimes 5 long. And there is sometimes a '[letter].docx' after the desired digit, and other times just '.docx'.

I've written a brute force solution, but I'd like to learn how to do it more elegantly, with two specific questions.

Two questions:

How could one write the regex language below more generally? I am able to brute force in my case because I only have ten variations, but I'd love to see a general solution.
Why doesn't the array() option work? I'm trying to implement what I understand to be described here. For some reason in my case, R returns an error after the third element of the replacement array.

Data:

data$file
XX12_1a.docx
XX4_1b.docx
XX35_4.docx
XX9_3.docx
XX21_2.docx

Goal:

data$id
1
1
4
3
2

SSCCE:

require('tidyverse')

data <- data.frame(file = c('XX12_1a.docx',
               'XX4_1b.docx',
               'XX35_4.docx',
               'XX9_3.docx',
               'XX21_2.docx'))

# Brute force solution:
data$id <- str_replace(data$file, '.....1a.....', '1')
data$id <- str_replace(data$id, '.....1b.....', '1')
data$id <- str_replace(data$id, '.....2.....', '2')
data$id <- str_replace(data$id, '.....3.....', '3')
data$id <- str_replace(data$id, '.....4.....', '4')
data$id <- str_replace(data$id, '....1a.....', '1')
data$id <- str_replace(data$id, '....1b.....', '1')
data$id <- str_replace(data$id, '....2.....', '2')
data$id <- str_replace(data$id, '....3.....', '3')
data$id <- str_replace(data$id, '....4.....', '4')

# More concise attempt, does not run
data$id2 <- str_replace(data$file, 
            array('.....1a.....', 
                  '.....1b.....', 
                  '.....2.....', 
                  '.....3.....',
                  '.....4.....',
                  '....1a.....',
                  '....1b.....',
                  '....2.....',
                  '....3.....',
                  '....4.....'), 
            array('1', '1', '2', '3', '4', '1', '1', '2', '3', '4'))

score 2 · Answer 1 · answered Aug 12 '21 at 16:48

You could just use sub here:

data <- data.frame(file=c("XX12_1a.docx", "XX4_1b.docx", "XX35_4.docx", "XX9_3.docx", "XX21_2.docx"))
data$id <- sub("^.*_(\\d+).*$", "\\1", data$file)
data

          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

score 2 · Answer 2 · answered Aug 12 '21 at 16:52

2

You could use extract:

library(tidyverse)
data <- data %>%
   extract(file, 'id', '_(\\d+)', remove = FALSE)
          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

answered Aug 12 '21 at 16:52

Onyambu

67,392
3
24
53

score 2 · Answer 3 · answered Aug 12 '21 at 17:06

An option with trimws from base R

data$id <- trimws(data$file, whitespace = ".*_|\\D?\\..*")

-ouptut

> data
          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

data

data <- structure(list(file = c("XX12_1a.docx", "XX4_1b.docx", "XX35_4.docx", 
"XX9_3.docx", "XX21_2.docx")), class = "data.frame", row.names = c(NA, 
-5L))

score 2 · Accepted Answer · answered Aug 12 '21 at 17:11

2

Since the target digit is, as it seems from your examples, always preceded by _ you can use lookbehind:

library(stringr)
str_extract(data$file, "(?<=_)\\d")

answered Aug 12 '21 at 17:11

Chris Ruehlemann

20,321
4
12
34

score 1 · Answer 5 · answered Aug 12 '21 at 17:30

Here is a tidyverse solution:

library(tidyverse)
data %>% 
  separate(file, c("split1", "split2"), remove=FALSE) %>% 
  mutate(id = parse_number(split2), .keep="unused") %>% 
  select(-split1)

output:

          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

Efficient extraction of number in middle of semi-irregular text string

5 Answers5

data