0

I have a table in R with the following information. Some rows in employee have roman numerals, some do not:

employee <- c('JOHN SMITH II','PETER RABBIT','POPE GREGORY XIII', 'MARY SUE IV')
salary <- c(21000, 23400, 26800, 100000)
employee_df <- data.frame(employee, salary)
> employee_df
           employee salary
1     JOHN SMITH II  21000
2      PETER RABBIT  23400
3 POPE GREGORY XIII  26800
4       MARY SUE IV 100000

How would I remove the roman numerals so that employee_df$employee would be the follwing?

JOHN SMITH    PETER RABBIT    POPE GREGORY   MARY SUE
bltSandwich21
  • 432
  • 3
  • 10
  • 1
    You could use `as.roman()` to ensure that you are hitting real roman numerals. This would be safer than a regex. Also see https://stackoverflow.com/questions/21116763/convert-roman-numerals-to-numbers-in-r – Michael Sebald Oct 29 '20 at 16:49
  • `XI` for instance is a valid name, what do you do in that circumstance? – thelatemail Oct 29 '20 at 22:40

3 Answers3

2

Try this:

#Code
employee_df$employee <-gsub('^([0-9]+)|([IVXLCM]+)\\.?$','',employee_df$employee)

Output:

       employee salary
1   JOHN SMITH   21000
2  PETER RABBIT  23400
3 POPE GREGORY   26800
4     MARY SUE  100000

Or cleaner:

#Code2
employee_df$employee <- trimws(gsub('^([0-9]+)|([IVXLCM]+)\\.?$','',employee_df$employee))

Output:

      employee salary
1   JOHN SMITH  21000
2 PETER RABBIT  23400
3 POPE GREGORY  26800
4     MARY SUE 100000

The numeric component of regex is not necessary (Many thanks @BenBolker). You can use:

#Code3
employee_df$employee <- trimws(gsub('([IVXLCM]+)\\.?$','',employee_df$employee))

And obtain the same result.

Duck
  • 39,058
  • 13
  • 42
  • 84
0

If you mean what you say then you do not want to remove the roman numerals in the string with POPE. If that's correct then a way to remove all other numerals is this:

sub("^(?!\\bPOPE\\b)(.*?)\\s[IVXLCM]+$", "\\1", employee_df$employee, perl = T)
[1] "JOHN SMITH"        "PETER RABBIT"      "POPE GREGORY XIII" "MARY SUE"

Here we are using negative lookahead (?!...)asserting that the strings must not include the substring POPE, and backreference \\1 to 'recollect' whatever is matched prior to the sequence of roman numerals at the end of the string.

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

An option with str_remove

library(dplyr)
library(stringr)
employee_df %>%
        mutate(employee = str_remove(employee, "\\s+[IVXLCM]+"))
akrun
  • 874,273
  • 37
  • 540
  • 662