Create new column with dplyr mutate and substring of existing column

Question

I have a dataframe with a column of strings and want to extract substrings of those into a new column.

Here is some sample code and data showing I want to take the string after the final underscore character in the id column in order to create a new_id column. The id column entry always has 2 underscore characters and it's always the final substring I would like.

df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )

require(dplyr)

df = df %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3])

I was expecting strsplit to act on each row in turn.

However, the new_id column only contains ABC in each row, whereas I would like ABC in row 1 and NHYK in row 2. Do you know why this fails and how to achieve what I want?

it's because of your call to `strsplit`. the `[[1]]` always grabs the first element of the list. — Lloyd Christmas, Feb 01 '17 at 18:26
in base R, its as simple as a little regex magic: `df$newVar <- sub(".*_([A-Z]+)$", "\\1", df$id)`. — lmo, Feb 01 '17 at 19:41

score 35 · Accepted Answer · answered Feb 01 '17 at 18:38

You could use stringr::str_extract:

library(stringr)

 df %>%
   dplyr::mutate(new_id = str_extract(id, "[^_]+$"))


#>              id x new_id
#> 1  abcd_123_ABC 1    ABC
#> 2 abc_5234_NHYK 2   NHYK

The regex says, match one or more (+) of the characters that aren't _ (the negating [^ ]), followed by end of string ($).

score 29 · Answer 2 · answered Jun 08 '17 at 00:00

An alternative without regex and keeping in the tidyverse style is to use tidyr::separate(). Note, this does remove the input column by default (remove=FALSE to prevent it).

## using your example data
df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )

## separate knowing you will have three components
df %>% separate(id, c("first", "second", "new_id"), sep = "_") %>% select(-first, -second)
## returns
  new_id x
1    ABC 1
2   NHYK 2

score 12 · Answer 3 · answered Feb 01 '17 at 18:39

12

Use dplyr::rowwise:

df %>% dplyr::rowwise() %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3])

Further alternatives are discussed here:

http://www.expressivecode.org/2014/12/17/mutating-using-functions-in-dplyr/

answered Feb 01 '17 at 18:39

Philipp Merkle

2,555
2
11
22

2

Note that this will be slower than typical `dplyr` as it can't benefit from vectorized operations. Still, +1 for the tip. – vincentmajor Jun 07 '17 at 21:20

score 6 · Answer 4 · answered Apr 05 '21 at 10:32

This can be done using str_split by specifying the simplify argument.

Simplify unlists the split strings and allows element selection using an index. In this case where there is always 2x "_", we can always take the third element.

# Create df
df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )

# Create new_id using dplyr only
df <- df %>% 
  mutate(new_id = str_split(id, "_", simplify = TRUE)[ , 3])

See https://github.com/tidyverse/stringr/issues/265

score 1 · Answer 5 · answered Feb 01 '17 at 18:31

Here's one way to use strsplit in a general way to do what you're looking for.

library(dplyr)
df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )

temp <- seq(from=3, by=3, length.out = length(df))
dfn <- df %>% dplyr::mutate(new_id = unlist(strsplit(id, split="_"))[temp])

> dfn
             id x new_id
1  abcd_123_ABC 1    ABC
2 abc_5234_NHYK 2   NHYK

Create new column with dplyr mutate and substring of existing column

5 Answers5

Linked