R remove roman numerals from column

Question

I have a table in R with the following information. Some rows in employee have roman numerals, some do not:

employee <- c('JOHN SMITH II','PETER RABBIT','POPE GREGORY XIII', 'MARY SUE IV')
salary <- c(21000, 23400, 26800, 100000)
employee_df <- data.frame(employee, salary)
> employee_df
           employee salary
1     JOHN SMITH II  21000
2      PETER RABBIT  23400
3 POPE GREGORY XIII  26800
4       MARY SUE IV 100000

How would I remove the roman numerals so that employee_df$employee would be the follwing?

JOHN SMITH    PETER RABBIT    POPE GREGORY   MARY SUE

You could use `as.roman()` to ensure that you are hitting real roman numerals. This would be safer than a regex. Also see https://stackoverflow.com/questions/21116763/convert-roman-numerals-to-numbers-in-r — Michael Sebald, Oct 29 '20 at 16:49
`XI` for instance is a valid name, what do you do in that circumstance? — thelatemail, Oct 29 '20 at 22:40

Duck · Accepted Answer · 2020-10-29T16:53:04.577

2

Try this:

#Code
employee_df$employee <-gsub('^([0-9]+)|([IVXLCM]+)\\.?$','',employee_df$employee)

Output:

       employee salary
1   JOHN SMITH   21000
2  PETER RABBIT  23400
3 POPE GREGORY   26800
4     MARY SUE  100000

Or cleaner:

#Code2
employee_df$employee <- trimws(gsub('^([0-9]+)|([IVXLCM]+)\\.?$','',employee_df$employee))

Output:

      employee salary
1   JOHN SMITH  21000
2 PETER RABBIT  23400
3 POPE GREGORY  26800
4     MARY SUE 100000

The numeric component of regex is not necessary (Many thanks @BenBolker). You can use:

#Code3
employee_df$employee <- trimws(gsub('([IVXLCM]+)\\.?$','',employee_df$employee))

And obtain the same result.

edited Oct 29 '20 at 16:53

answered Oct 29 '20 at 16:44

Duck

39,058
13
42
84

what's the `^([0-9]+)` part of the regex for ... ? – Ben Bolker Oct 29 '20 at 16:47
@BenBolker It is a way to take care of numbers if they are present! – Duck Oct 29 '20 at 16:49
@BenBolker But is not necessary. I will update the post! – Duck Oct 29 '20 at 16:51

score 0 · Answer 2 · answered Oct 29 '20 at 18:24

If you mean what you say then you do not want to remove the roman numerals in the string with POPE. If that's correct then a way to remove all other numerals is this:

sub("^(?!\\bPOPE\\b)(.*?)\\s[IVXLCM]+$", "\\1", employee_df$employee, perl = T)
[1] "JOHN SMITH"        "PETER RABBIT"      "POPE GREGORY XIII" "MARY SUE"

Here we are using negative lookahead (?!...)asserting that the strings must not include the substring POPE, and backreference \\1 to 'recollect' whatever is matched prior to the sequence of roman numerals at the end of the string.

Sorry, I meant that we do want to remove them for `POPE`. I've changed my original question. — bltSandwich21, Oct 29 '20 at 20:17

score 0 · Answer 3 · answered Oct 29 '20 at 22:30

0

An option with str_remove

library(dplyr)
library(stringr)
employee_df %>%
        mutate(employee = str_remove(employee, "\\s+[IVXLCM]+"))

answered Oct 29 '20 at 22:30

akrun

874,273
37
540
662

R remove roman numerals from column

3 Answers3