using strsplit and subset in dplyr and mutate

Question

I have a data table with one string column. I'd like to create another column that is a subset of this column using strsplit.

dat <- data.table(labels=c('a_1','b_2','c_3','d_4'))

The output I want is

label  sub_label
a_1    a
b_2    b
c_3    c
d_4    d

I've tried the followings but neither seems to work.

dat %>%
    mutate(
        sub_labels=strsplit(as.character(labels), "_")[[1]][1]
    ) 
# gives a column whose values are all "a"

this one, which seems logical to me,

dat %>%
    mutate(
        sub_labels=sapply(strsplit(as.character(labels), "_"), function(x) x[[1]][1])
    )

gives an error

Error: Don't know how to handle type pairlist

I saw another post where paste-collapse on the output from strsplit worked so I don't understand why subsetting in an anonymous function is giving issues. Thanks for any elucidation on this.

It's simpler to use regex or `substr`, as they return strings instead of a list: `dat %>% mutate(sub_label = sub('_.*', '', labels))` Another option is `tidyr::separate` with `extra = 'drop'` and `remove = FALSE`: `dat %>% separate(labels, 'sub_label', extra = 'drop', remove = FALSE)` — alistaire, Mar 02 '17 at 20:51
weird, I just run your last code `dat %>% mutate(sub_labels=sapply(strsplit(as.character(labels), "_"), function(x) x[[1]][1]))` and it worked fine, did not get an error — Djork, Mar 02 '17 at 20:52
If you have a `data.table` just do `dat[, c("first","second") := tstrsplit(labels,"_")]` — thelatemail, Mar 02 '17 at 23:41
Thanks @thelatemail. Inexplicably, the output doesn't get printed first time around even when assigned to an object (I assigned it to, say, x and i have to print x twice to see the table), but it works great and is succinct. — chungkim271, Mar 03 '17 at 14:48

score 46 · Answer 1 · answered Mar 02 '17 at 20:58

46

tidyr::separate can help here:

> dat %>% separate(labels, c("first", "second") )
   first second
1:     a      1
2:     b      2
3:     c      3
4:     d      4

answered Mar 02 '17 at 20:58

Romain Francois

17,432
3
51
77

Though this doesn't retain the original column, which I think is where the problem lies. – thelatemail Mar 02 '17 at 23:27
4

@thelatemail I think you can specify `remove = FALSE` in order to handle that. – David Arenburg Mar 05 '17 at 14:31
Awesome! I had no idea this function existed. I was trying to solve a similar problem to OP and this is exactly what I needed. Thanks! – Andrew Brēza May 20 '19 at 20:07

score 15 · Answer 2 · edited Jul 14 '21 at 19:22

Another method uses purrr's map_chr, which I've found useful for applications where I didn't want to bother with separating and uniting (e.g. using the results in a sprintf with other strings):

tibble(labels=c('a_1','b_2','c_3','d_4')) %>% 
  mutate(sub_label = stringr::str_split(labels, "_") %>% map_chr(., 1))

This method can be substantially faster than separate in my experience, especially when you have longer datasets. separate barely beats map when I use 100 strings, but falls behind in most cases when I use 1000 (not sure what's up with that max).

    > microbenchmark::microbenchmark(
+   d.filtered_reads %>% head(1000) %>% 
+     mutate(name = stringr::str_split(Header, " ") %>% map_chr(., 1)) %>% 
+     select(-Header),
+   d.filtered_reads %>% head(1000) %>% 
+     separate(Header, into = c("name","index"), sep = " ") %>% 
+     select(-"index")
+ )
Unit: milliseconds
                                                                                                                          expr
 d.filtered_reads %>% head(1000) %>% mutate(name = stringr::str_split(Header,      " ") %>% map_chr(., 1)) %>% select(-Header)
          d.filtered_reads %>% head(1000) %>% separate(Header, into = c("name",      "index"), sep = " ") %>% select(-"index")
      min       lq     mean   median       uq       max neval
 5.333891 5.817589 6.292954 5.935706 6.059031 41.530089   100
 7.517316 8.031325 8.399471 8.500359 8.647468  9.855612   100

It is worth to mention that `strsplit` is replaced by `str_split` from `stringr` (https://github.com/tidyverse/stringr). Also, this code works as well an alternative: `dat %>% mutate(sub_label = sapply(str_split(labels, "_"), function(x) x[1]))` — Sebastian Müller, Nov 22 '19 at 14:55
Thank you! I've added `stringr::` to clarify that. I usually just show that I'm loading tidyverse but I forgot to do that here so it's an important clarification. — GenesRus, Nov 22 '19 at 23:39

score 9 · Answer 3 · answered Nov 23 '21 at 18:06

9

I didn't come up with this, I just stumbled on this github issue while looking for a solution, and think it is simpler than many of the answers here, particularly avoiding an extra map_chr() or tmp_chunks.

# I used data.frame since I don't have data table installed
library(dplyr)
library(stringr)
dat <- data.frame(labels=c('a_1','b_2','c_3','d_4'))
dat %>% mutate(sub_label = str_split(labels, "_", simplify = T)[, 1])
  labels sub_label
1    a_1         a
2    b_2         b
3    c_3         c
4    d_4         d

answered Nov 23 '21 at 18:06

Hendy

10,182
15
65
71

1

The map_chr comes in use moreso in the example I mentioned where you're using it with some other function that ends up needing purrr's mapping functionality to get all the vectors to play nicely. :) No doubt this is the simplest solution and is probably the fastest if the goal is simply to extract the character. – GenesRus Aug 28 '23 at 05:23

DomQ · Answer 4 · 2022-01-25T17:29:48.387

5

In case we want to extract several columns at once (without running the split again, of course) we can combine GenesRus's approach with a temporary column that we drop with negative select() further down the pipeline:

library(purrr)
library(dplyr)
library(tibble)
library(stringr)

tibble(labels=c('a_1','b_2','c_3','d_4')) %>% 
  mutate(tmp_chunks = stringr::str_split(labels, stringr::fixed("_"),  n = 2)) %>%
  mutate(sub_label = map_chr(tmp_chunks, 1),
         sub_value = map_chr(tmp_chunks, 2)) %>%
  select(-tmp_chunks)

As of 2020, performance is much better than separate().

For completeness, it is worth mentioning that

map_chr can take a .default parameter (in case the separator is missing in some lines),
one can also get rid of labels with negative select(), if desired

edited Jan 25 '22 at 17:29

answered Oct 01 '20 at 19:47

DomQ

4,184
38
37

Where does fixed() come from? I can't find that function. – GenesRus Dec 07 '20 at 04:06
1

Good catch @GenesRus. [It's from stringr](https://www.rdocumentation.org/packages/stringr/versions/0.6.2/topics/fixed). Updating code excerpt. – DomQ Dec 08 '20 at 15:07

using strsplit and subset in dplyr and mutate

4 Answers4

Linked