Replace elements in one vector with elements containing a similar pattern in another vector

Question

name_w_degree <- c("Julie (Dr)", "Helen (MD)", "Faye")

df <- data.frame(name = c("Julie", "Helen", "Faye", "Faye", "Helen"),
                 value = c(1, 2, 3, 4, 5))

I would like to replace "Julie" in df with an element which starts with "Julie" from vector name_w_degree (i.e. Julie (Dr)). Similary, replace "Helen" in df with an element starting with "Helen" from name_w_degree (i.e. Helen (MD)) and "Faye" in df with an element starting with "Faye" from name_w_degree. Keep value column as is.

Is there a way to mutate the values under the name column taking advantage of the "starting with a corresponding name" pattern rather than hard code? Thanks in advance.

Expecting the mutated df to be:

name_w_degree	value
Julie (Dr)	1
Helen (MD)	2
Faye	3
Faye	4
Helen (MD)	5

Maybe have a look at [Is there a dictionary functionality in R](https://stackoverflow.com/questions/7818970). — GKi, Jul 12 '23 at 21:01

SamR · Answer 1 · 2023-07-12T09:02:56.690

It is quicker to do vectorised replacement rather than than apply a function to every row, particularly as the size of the data increases. As you want to match on the first word, you can use setNames() to create a named vector of patterns and replacements. You can then do vectorised replacement with stringr::str_replace_all():

df$name  <- stringr::str_replace_all(df$name, setNames(name_w_degree, gsub("\\s.+", "", name_w_degree)))
df
#         name value
# 1 Julie (Dr)     1
# 2 Helen (MD)     2
# 3       Faye     3
# 4       Faye     4
# 5 Helen (MD)     5

Benchmarks

It doesn't make much difference with small data frames but as they grow this method becomes relatively much quicker than non-vectorised approaches. The performance is similar to the answer by @ThomasIsCoding until around 800k rows, at which point that approach becomes significantly faster than this one. Both are much faster than approaches which are not vectorised (e.g. using sapply() or map()), presumably because of the overhead of calling a function many times.

Benchmark code

n <- c(1, 10, 100, 1e3, 1e4, 1e5, 1e6)
results <- bench::press(
    n = n,
    {
        # replicate df n times
        big_df <- do.call(rbind, replicate(n, df, simplify = FALSE))

        bench::mark(
            min_iterations = 1,
            max_iterations = 100,
            check = FALSE,
            rowwise = {
                big_df %>%
                    rowwise() %>%
                    mutate(name = name_w_degree[grepl(name, name_w_degree)])
            },
            base_sapply = {
                sapply(big_df$name, function(x) {
                    name_w_degree[which(grepl(x, name_w_degree))]
                })
            },
            purrr_map_chr = {
                big_df %>%
                    mutate(name = map_chr(name, ~ grep(pattern = ., x = name_w_degree, value = TRUE)))
            },
            stringr_replace_all = {
                stringr::str_replace_all(big_df$name, setNames(name_w_degree, gsub("\\s.+", "", name_w_degree)))
            },
            base_transform = {
                transform(big_df, name = name_w_degree[match(name, sub("\\W+.*", "", name_w_degree))])
            }
        )
    }
)

Code to generate plot


library(ggplot2)
results |>
    transmute(
        expression = attr(expression, "description"),
        n = n * 5,
        median
    ) |>
    ggplot(aes(x = n, y = median, group = expression)) +
    geom_line(aes(color = expression), size = 1) +
    geom_point(aes(color = expression), size = 2) +
    scale_x_log10(n.breaks = length(n)) +
    theme_bw() +
    theme(
        legend.position = "bottom"
    ) +
    labs(
        title = "Comparison of results",
        x = "Number of rows",
        y = "Median time to run (seconds)"
    )

I think your vectorized approach is fast enough, but `setNames` is slower and `gsub` is a bit less efficient than `sub`, so you can see some speed improvement if you use `match` + `sub`. Anyway, impressive benchmarking, +1! — ThomasIsCoding, Jul 12 '23 at 06:47

score 2 · Answer 2 · answered Jul 12 '23 at 06:36

2

With base R, try match + sub like below

> transform(df, name = name_w_degree[match(name, sub("\\W+.*", "", name_w_degree))])
        name value
1 Julie (Dr)     1
2 Helen (MD)     2
3       Faye     3
4       Faye     4
5 Helen (MD)     5

answered Jul 12 '23 at 06:36

ThomasIsCoding

96,636
9
24
81

2

Nice! I've added it to my benchmarks. It's the fastest approach of the ones posted now. – SamR Jul 12 '23 at 06:43
What in cast if there is not match? – GKi Jul 12 '23 at 09:55
@GKi Then we should take additional actions like `coalesce(..., name)` – ThomasIsCoding Jul 12 '23 at 12:00

Park · Answer 3 · 2023-07-12T06:19:51.973

You may try

library(dplyr)

df %>%
  rowwise %>%
  mutate(name = name_w_degree[grepl(name, name_w_degree)])

  name       value
  <chr>      <dbl>
1 Julie (Dr)     1
2 Helen (MD)     2
3 Faye           3
4 Faye           4
5 Helen (MD)     5

or

df$name <- sapply(df$name, function(x) {name_w_degree[grepl(x, name_w_degree)]})
df

        name value
1 Julie (Dr)     1
2 Helen (MD)     2
3       Faye     3
4       Faye     4
5 Helen (MD)     5

score 1 · Answer 4 · answered Jul 12 '23 at 06:07

1

Another way:

library(tidyverse)

df %>% 
  mutate(name = map_chr(name, ~ grep(pattern = ., x = name_w_degree, value = TRUE)))

        name value
1 Julie (Dr)     1
2 Helen (MD)     2
3       Faye     3
4       Faye     4
5 Helen (MD)     5

answered Jul 12 '23 at 06:07

Mark

7,785
2
14
34

Mark · Answer 5 · 2023-07-12T06:47:45.607

0

Benchmarking the answers so far:

name_w_degree <- c("Julie (Dr)", "Helen (MD)", "Faye")

df <- data.frame(name = c("Julie", "Helen", "Faye", "Faye", "Helen"),
                 value = c(1, 2, 3, 4, 5))

library(tidyverse)

# benchmark the above solutions
bench::mark(
  str_replace_all = stringr::str_replace_all(df$name, setNames(name_w_degree, gsub("\\s.+", "", name_w_degree))),
  map_chr = df %>% 
    mutate(name = map_chr(name, ~ grep(pattern = ., x = name_w_degree, value = TRUE))),
  rowwise = df %>%
    rowwise %>%
    mutate(name = name_w_degree[grepl(name, name_w_degree)]),
    check = FALSE,
  sapply = sapply(df$name, function(x) {name_w_degree[grepl(x, name_w_degree)]}),
  match_and_sub = transform(df, name = name_w_degree[match(name, sub("\\W+.*", "", name_w_degree))])
) %>%
  arrange(median) %>%
  print(width = Inf)


# A tibble: 5 × 13
  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 sapply           20.91µs   22.3µs    43514.    8.84KB     39.2  9991     9
2 match_and_sub    52.97µs   55.8µs    17398.   16.06KB     38.7  8095    18
3 str_replace_all  84.34µs     88µs    11161.  130.08KB     32.6  5133    15
4 map_chr         369.29µs  384.6µs     2548.    1.35MB     44.2  1154    20
5 rowwise           1.15ms    1.2ms      810.     1.2MB     40.7   358    18

tl;dr - rowwise is slow, and the fastest are the sapply, match and sub, and str_replace_all one from SamR (probably because they don't load the entire dataframe like in mine, but just a guess)

edited Jul 12 '23 at 06:47

answered Jul 12 '23 at 06:24

Mark

7,785
2
14
34

1

I edited my answer with some benchmarks as well - the difference becomes more pronounced as the size of the data increases. – SamR Jul 12 '23 at 06:29
1

I was just looking at that- it's very interesting!! – Mark Jul 12 '23 at 06:29
I think as `n` approaches infinity the approaches that use `map()`, `rowwise()` or `sapply()` will ultimately more or less converge, as the overhead is calling a function many times, rather than the relatively simple operation written in C that's happening within all of our functions. – SamR Jul 12 '23 at 06:35
I just ran the code in my comparison one, but with iterations = 100000, and I still get the lowest median by far from the sapply one ‍ – Mark Jul 12 '23 at 06:39
same order as before too! 22.02µs, 85.61µs, 373.31µs, 1.16ms – Mark Jul 12 '23 at 06:40
the memory allocation on the str_replace_all one is obviously a lot lower though, which is the C I guess – Mark Jul 12 '23 at 06:40
That seems odd... can you post the entirety of what you ran? Btw might be better to edit your benchmarks into your answer than posting a second answer. – SamR Jul 12 '23 at 06:41
I'll post it now. I posted a second answer as that is it's own thing, I figure – Mark Jul 12 '23 at 06:44
@SamR updated it. It might be because I'm not updating the dataframe with the new column in the code – Mark Jul 12 '23 at 06:48
1

I have the same results as you when the data frame is 5 rows, `sapply()` is fastest. It's only really when you get to 5,000 rows or more that you see a significant difference. – SamR Jul 12 '23 at 06:57
oddly, sometimes I run the benchmark, and it says the match and sub one uses 0B of memory allocated guessing because it's pure C – Mark Jul 12 '23 at 06:58
1

I've had that issue before - see [here](https://stackoverflow.com/questions/74809614/why-is-outer-slower-than-a-for-loop-in-r/74810503#74810503). I don't know exactly why it happens but in my experience it can't distinguish between zero and very low memory usage. – SamR Jul 12 '23 at 07:02

GKi · Answer 6 · 2023-07-12T10:31:46.790

A possibility is to store the names as a factor and change only the levels in case there is a match of the names using startsWith.
In case there are e.g. Helen and Helena those could be written with a space at the end of their name (Helen , Helena) to distinguish.

df$name <- as.factor(df$name)
levels(df$name) <- unlist(lapply(levels(df$name), \(s) {
       i <- match(TRUE, startsWith(name_w_degree, s))
       if(is.na(i)) s
       else name_w_degree[i] } ) )

df
#        name value
#1 Julie (Dr)     1
#2 Helen (MD)     2
#3       Faye     3
#4       Faye     4
#5 Helen (MD)     5

Or a variant using sub, fastmatch and collapse.

library(fastmatch)
library(collapse)
tt <- qF(df$name)
i <- fmatch(levels(tt), sub(" .*", "", name_w_degree))
j <- which(!is.na(i))
levels(tt)[j] <- name_w_degree[i[j]]
levels(tt)[tt]
#[1] "Julie (Dr)" "Helen (MD)" "Faye"       "Faye"       "Helen (MD)"

Or just using fastmatch.

library(fastmatch)
i <- fmatch(df$name, sub(" .*", "", name_w_degree))
j <- which(!is.na(i))
`[<-`(df$name, j, name_w_degree[i[j]])
#[1] "Julie (Dr)" "Helen (MD)" "Faye"       "Faye"       "Helen (MD)"

Merijn van Tilborg · Answer 7 · 2023-07-12T10:07:49.813

0

I would create your lookup as a named vector, I assume your names list is pretty small, otherwise you have to think of a better solution anyhow as there might be more Helen's all having different degrees. You can also consider a lookup table and join them. But having said that, here how I would do it (credits for the sub code @ThomasIsCoding)

note all names must be present in both the degrees as in the data.

names(name_w_degree) <- sub("\\W+.*", "", name_w_degree)

df$name <- name_w_degree[df$name]

df

#         name value
# 1 Julie (Dr)     1
# 2 Helen (MD)     2
# 3       Faye     3
# 4       Faye     4
# 5 Helen (MD)     5

data

name_w_degree <- c("Julie (Dr)", "Helen (MD)", "Faye")

df <- data.frame(name = c("Julie", "Helen", "Faye", "Faye", "Helen"),
                 value = c(1, 2, 3, 4, 5))

edited Jul 12 '23 at 10:07

answered Jul 12 '23 at 09:41

Merijn van Tilborg

5,452
1
7
22

What in cast if there is not match? – GKi Jul 12 '23 at 09:55
That is a good point, it can only be used when all names are defines (which is in the example, taking note of "Faye" being present too. It also not covers for more Helen's for example, like Helen (MD) and a Helen (Dr) like all other solutions. I add it to the answer that it only works if all names are known. – Merijn van Tilborg Jul 12 '23 at 10:06

Replace elements in one vector with elements containing a similar pattern in another vector

7 Answers7

Benchmarks

Benchmark code

Code to generate plot