I would like to extract a one digit number from the middle of a text string that has small variations. The number of characters before the desired digit is sometimes 4 long and sometimes 5 long. And there is sometimes a '[letter].docx' after the desired digit, and other times just '.docx'.
I've written a brute force solution, but I'd like to learn how to do it more elegantly, with two specific questions.
Two questions:
- How could one write the regex language below more generally? I am able to brute force in my case because I only have ten variations, but I'd love to see a general solution.
- Why doesn't the array() option work? I'm trying to implement what I understand to be described here. For some reason in my case, R returns an error after the third element of the replacement array.
Data:
data$file
XX12_1a.docx
XX4_1b.docx
XX35_4.docx
XX9_3.docx
XX21_2.docx
Goal:
data$id
1
1
4
3
2
SSCCE:
require('tidyverse')
data <- data.frame(file = c('XX12_1a.docx',
'XX4_1b.docx',
'XX35_4.docx',
'XX9_3.docx',
'XX21_2.docx'))
# Brute force solution:
data$id <- str_replace(data$file, '.....1a.....', '1')
data$id <- str_replace(data$id, '.....1b.....', '1')
data$id <- str_replace(data$id, '.....2.....', '2')
data$id <- str_replace(data$id, '.....3.....', '3')
data$id <- str_replace(data$id, '.....4.....', '4')
data$id <- str_replace(data$id, '....1a.....', '1')
data$id <- str_replace(data$id, '....1b.....', '1')
data$id <- str_replace(data$id, '....2.....', '2')
data$id <- str_replace(data$id, '....3.....', '3')
data$id <- str_replace(data$id, '....4.....', '4')
# More concise attempt, does not run
data$id2 <- str_replace(data$file,
array('.....1a.....',
'.....1b.....',
'.....2.....',
'.....3.....',
'.....4.....',
'....1a.....',
'....1b.....',
'....2.....',
'....3.....',
'....4.....'),
array('1', '1', '2', '3', '4', '1', '1', '2', '3', '4'))