1

I would like to extract a one digit number from the middle of a text string that has small variations. The number of characters before the desired digit is sometimes 4 long and sometimes 5 long. And there is sometimes a '[letter].docx' after the desired digit, and other times just '.docx'.

I've written a brute force solution, but I'd like to learn how to do it more elegantly, with two specific questions.

Two questions:

  1. How could one write the regex language below more generally? I am able to brute force in my case because I only have ten variations, but I'd love to see a general solution.
  2. Why doesn't the array() option work? I'm trying to implement what I understand to be described here. For some reason in my case, R returns an error after the third element of the replacement array.

Data:

data$file
XX12_1a.docx
XX4_1b.docx
XX35_4.docx
XX9_3.docx
XX21_2.docx

Goal:

data$id
1
1
4
3
2

SSCCE:

require('tidyverse')

data <- data.frame(file = c('XX12_1a.docx',
               'XX4_1b.docx',
               'XX35_4.docx',
               'XX9_3.docx',
               'XX21_2.docx'))

# Brute force solution:
data$id <- str_replace(data$file, '.....1a.....', '1')
data$id <- str_replace(data$id, '.....1b.....', '1')
data$id <- str_replace(data$id, '.....2.....', '2')
data$id <- str_replace(data$id, '.....3.....', '3')
data$id <- str_replace(data$id, '.....4.....', '4')
data$id <- str_replace(data$id, '....1a.....', '1')
data$id <- str_replace(data$id, '....1b.....', '1')
data$id <- str_replace(data$id, '....2.....', '2')
data$id <- str_replace(data$id, '....3.....', '3')
data$id <- str_replace(data$id, '....4.....', '4')

# More concise attempt, does not run
data$id2 <- str_replace(data$file, 
            array('.....1a.....', 
                  '.....1b.....', 
                  '.....2.....', 
                  '.....3.....',
                  '.....4.....',
                  '....1a.....',
                  '....1b.....',
                  '....2.....',
                  '....3.....',
                  '....4.....'), 
            array('1', '1', '2', '3', '4', '1', '1', '2', '3', '4'))
Dr. Beeblebrox
  • 838
  • 2
  • 13
  • 30

5 Answers5

2

You could just use sub here:

data <- data.frame(file=c("XX12_1a.docx", "XX4_1b.docx", "XX35_4.docx", "XX9_3.docx", "XX21_2.docx"))
data$id <- sub("^.*_(\\d+).*$", "\\1", data$file)
data

          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
2

You could use extract:

library(tidyverse)
data <- data %>%
   extract(file, 'id', '_(\\d+)', remove = FALSE)
          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

An option with trimws from base R

data$id <- trimws(data$file, whitespace = ".*_|\\D?\\..*")

-ouptut

> data
          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2

data

data <- structure(list(file = c("XX12_1a.docx", "XX4_1b.docx", "XX35_4.docx", 
"XX9_3.docx", "XX21_2.docx")), class = "data.frame", row.names = c(NA, 
-5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Since the target digit is, as it seems from your examples, always preceded by _ you can use lookbehind:

library(stringr)
str_extract(data$file, "(?<=_)\\d")
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
1

Here is a tidyverse solution:

library(tidyverse)
data %>% 
  separate(file, c("split1", "split2"), remove=FALSE) %>% 
  mutate(id = parse_number(split2), .keep="unused") %>% 
  select(-split1)

output:

          file id
1 XX12_1a.docx  1
2  XX4_1b.docx  1
3  XX35_4.docx  4
4   XX9_3.docx  3
5  XX21_2.docx  2
TarJae
  • 72,363
  • 6
  • 19
  • 66