331

I'd like to take data of the form

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
  attr          type
1    1   foo_and_bar
2   30 foo_and_bar_2
3    4   foo_and_bar
4    6 foo_and_bar_2

and use split() on the column "type" from above to get something like this:

  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

I came up with something unbelievably complex involving some form of apply that worked, but I've since misplaced that. It seemed far too complicated to be the best way. I can use strsplit as below, but then unclear how to get that back into 2 columns in the data frame.

> strsplit(as.character(before$type),'_and_')
[[1]]
[1] "foo" "bar"

[[2]]
[1] "foo"   "bar_2"

[[3]]
[1] "foo" "bar"

[[4]]
[1] "foo"   "bar_2"

Thanks for any pointers. I've not quite groked R lists just yet.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
jkebinger
  • 3,944
  • 4
  • 19
  • 14

18 Answers18

353

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)
hadley
  • 102,019
  • 32
  • 183
  • 245
  • 4
    this worked pretty fine for my problem today as well.. but it was adding a 'c' at the beginning of each row. Any idea why is that??? `left_right <- str_split_fixed(as.character(split_df),'\">',2)` – LearneR Jul 28 '15 at 06:53
  • I would like to split with a pattern that has "...", when I apply that function, it returns nothing. What could be the problem. my type is something like "test...score" – user3841581 Mar 14 '16 at 08:15
  • 5
    @user3841581 - old query of yours I know, but this is covered in the documentation - `str_split_fixed("aaa...bbb", fixed("..."), 2)` works fine with `fixed()` to "Match a fixed string" in the `pattern=` argument. `.` means 'any character' in regex. – thelatemail Aug 09 '17 at 04:30
  • Thanks hadley, very convinient method, but there is one thing can be improved, if there is NA in the original column, after separation it will become sevaral empty string in result columns, which is unwanted, I want to keep the NA still NA after separation – cloudscomputes Sep 15 '17 at 03:28
  • Works well i.e. if the separator is missing ! i.e. if I have a vector 'a<-c("1N", "2N")' that I would like to separate in columns '1,1, "N", "N"' I run 'str_split_fixed(s, "", 2)'. I am just not sure how to name my new columns in this approach, 'col1<-c(1,1)' and 'col2<-c("N", "N")' – maycca May 22 '18 at 19:32
  • This wasn't directly the question but you can use the following `|` to use several strings that you want to search for. For instance, for `_and_` and `_or_`, the code would be: `str_split_fixed(before$type, c("_and_|_or_"), 2)` – Martin Sep 22 '21 at 12:22
267

Another option is to use the new tidyr package.

library(dplyr)
library(tidyr)

before <- data.frame(
  attr = c(1, 30 ,4 ,6 ), 
  type = c('foo_and_bar', 'foo_and_bar_2')
)

before %>%
  separate(type, c("foo", "bar"), "_and_")

##   attr foo   bar
## 1    1 foo   bar
## 2   30 foo bar_2
## 3    4 foo   bar
## 4    6 foo bar_2
hadley
  • 102,019
  • 32
  • 183
  • 245
  • 4
    Is there a way to limit number of splits with separate? Let's say I want to split on '_' only once (or do it with `str_split_fixed` and adding columns to existing dataframe)? – JelenaČuklina Jan 11 '16 at 11:42
  • @hadley How about if I want to split based on second `_`? I want the values as `foo_and`, `bar`/`bar_2`? – Prradep Nov 08 '21 at 13:07
  • 1
    `tidyr::separate` has been superseded by `tidyr::separate_wider_delim`. – robertspierre Jul 11 '23 at 20:32
92

5 years later adding the obligatory data.table solution

library(data.table) ## v 1.9.6+ 
setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]
before
#    attr          type type1 type2
# 1:    1   foo_and_bar   foo   bar
# 2:   30 foo_and_bar_2   foo bar_2
# 3:    4   foo_and_bar   foo   bar
# 4:    6 foo_and_bar_2   foo bar_2

We could also both make sure that the resulting columns will have correct types and improve performance by adding type.convert and fixed arguments (since "_and_" isn't really a regex)

setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    if the number of your `'_and_'` patterns vary, you can find out the maximum number of matches (i.e. future columns) with `max(lengths(strsplit(before$type, '_and_')))` – andschar Jun 04 '19 at 10:41
  • This is my favorite answer, works very well! Could you please explain how it works. Why transpose(strsplit(…)) and isn't paste0 for concatenating strings - not splitting them... – Gecko May 14 '20 at 11:00
  • 3
    @Gecko I'm not sure what is the question. If you just use `strsplit` it creates a single vector with 2 values in each slot, so `tstrsplit` transposes it into 2 vectors with a single value in each. `paste0` is just used in order to create the column names, it is not used on the values. On the LHS of the equation are the column names, on the RHS is the split + transpose operation on the column. `:=` stands for "*assign in place*", hence you don't see the `<-` assignment operator there. – David Arenburg May 14 '20 at 11:46
73

Yet another approach: use rbind on out:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  
out <- strsplit(as.character(before$type),'_and_') 
do.call(rbind, out)

     [,1]  [,2]   
[1,] "foo" "bar"  
[2,] "foo" "bar_2"
[3,] "foo" "bar"  
[4,] "foo" "bar_2"

And to combine:

data.frame(before$attr, do.call(rbind, out))
Aniko
  • 18,516
  • 4
  • 48
  • 45
  • 7
    Another alternative on newer R versions is `strcapture("(.*)_and_(.*)", as.character(before$type), data.frame(type_1 = "", type_2 = ""))` – alexis_laz Nov 10 '16 at 18:23
  • This is the correct solution. Simple and doesn't require thrid-party packages. – Cole Apr 04 '23 at 01:55
45

Notice that sapply with "[" can be used to extract either the first or second items in those lists so:

before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)
before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)
before$type <- NULL

And here's a gsub method:

before$type_1 <- gsub("_and_.+$", "", before$type)
before$type_2 <- gsub("^.+_and_", "", before$type)
before$type <- NULL
Stedy
  • 7,359
  • 14
  • 57
  • 77
IRTFM
  • 258,963
  • 21
  • 364
  • 487
34

here is a one liner along the same lines as aniko's solution, but using hadley's stringr package:

do.call(rbind, str_split(before$type, '_and_'))
Ramnath
  • 54,439
  • 16
  • 125
  • 152
30

To add to the options, you could also use my splitstackshape::cSplit function like this:

library(splitstackshape)
cSplit(before, "type", "_and_")
#    attr type_1 type_2
# 1:    1    foo    bar
# 2:   30    foo  bar_2
# 3:    4    foo    bar
# 4:    6    foo  bar_2
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    3 years later - this option is working best for a similar problem I have - however the dataframe I am working with has 54 columns and I need to split all of them into two. Is there a way to do this using this method - short of typing out the above command 54 times? Many thanks, Nicki. – Nicki Aug 03 '17 at 13:21
  • @Nicki, Have you tried providing a vector of the column names or the column positions? That should do it.... – A5C1D2H2I1M1N2O1R2T1 Aug 04 '17 at 16:12
  • It wasnt just renaming the columns - I needed to literally split the columns as above effectively doubling the number of columns in my df. The below was what I used in the end: df2 <- cSplit(df1, splitCols = 1:54, "/") – Nicki Aug 07 '17 at 13:20
29

The subject is almost exhausted, I 'd like though to offer a solution to a slightly more general version where you don't know the number of output columns, a priori. So for example you have

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar'))
  attr                    type
1    1             foo_and_bar
2   30           foo_and_bar_2
3    4 foo_and_bar_2_and_bar_3
4    6             foo_and_bar

We can't use dplyr separate() because we don't know the number of the result columns before the split, so I have then created a function that uses stringr to split a column, given the pattern and a name prefix for the generated columns. I hope the coding patterns used, are correct.

split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as.tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]

  names(cols) <- paste(into_prefix, 1:m, sep = "_")
  return(cols)
}

We can then use split_into_multiple in a dplyr pipe as follows:

after <- before %>% 
  bind_cols(split_into_multiple(.$type, "_and_", "type")) %>% 
  # selecting those that start with 'type_' will remove the original 'type' column
  select(attr, starts_with("type_"))

>after
  attr type_1 type_2 type_3
1    1    foo    bar   <NA>
2   30    foo  bar_2   <NA>
3    4    foo  bar_2  bar_3
4    6    foo    bar   <NA>

And then we can use gather to tidy up...

after %>% 
  gather(key, val, -attr, na.rm = T)

   attr    key   val
1     1 type_1   foo
2    30 type_1   foo
3     4 type_1   foo
4     6 type_1   foo
5     1 type_2   bar
6    30 type_2 bar_2
7     4 type_2 bar_2
8     6 type_2   bar
11    4 type_3 bar_3
Yannis P.
  • 2,745
  • 1
  • 24
  • 39
  • 1
    This is very useful. Years later, I am wondering if it is possible to introduce colnames that can be inserted in a for loop. For example, I want to split_into_multiple 10 columns (or more) and I don't want to split them each column at a time. I want the resulting split columns to bind together. I can't find the way to programatically select the column names (it gives me error) the option !!as.name() says there's not such attribute... How would you do it? – MEC Feb 21 '23 at 19:56
  • I am glad you still find it useful @MEC. My R knowledge has abandoned me and don't know how to hint you with this part. I think Python `pandas` might offer something but needs some homework... – Yannis P. Mar 18 '23 at 17:46
17

An easy way is to use sapply() and the [ function:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
out <- strsplit(as.character(before$type),'_and_')

For example:

> data.frame(t(sapply(out, `[`)))
   X1    X2
1 foo   bar
2 foo bar_2
3 foo   bar
4 foo bar_2

sapply()'s result is a matrix and needs transposing and casting back to a data frame. It is then some simple manipulations that yield the result you wanted:

after <- with(before, data.frame(attr = attr))
after <- cbind(after, data.frame(t(sapply(out, `[`))))
names(after)[2:3] <- paste("type", 1:2, sep = "_")

At this point, after is what you wanted

> after
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
10

Since R version 3.4.0 you can use strcapture() from the utils package (included with base R installs), binding the output onto the other column(s).

out <- strcapture(
    "(.*)_and_(.*)",
    as.character(before$type),
    data.frame(type_1 = character(), type_2 = character())
)

cbind(before["attr"], out)
#   attr type_1 type_2
# 1    1    foo    bar
# 2   30    foo  bar_2
# 3    4    foo    bar
# 4    6    foo  bar_2
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
9

Here is a base R one liner that overlaps a number of previous solutions, but returns a data.frame with the proper names.

out <- setNames(data.frame(before$attr,
                  do.call(rbind, strsplit(as.character(before$type),
                                          split="_and_"))),
                  c("attr", paste0("type_", 1:2)))
out
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

It uses strsplit to break up the variable, and data.frame with do.call/rbind to put the data back into a data.frame. The additional incremental improvement is the use of setNames to add variable names to the data.frame.

lmo
  • 37,904
  • 9
  • 56
  • 69
7

This question is pretty old but I'll add the solution I found the be the simplest at present.

library(reshape2)
before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
newColNames <- c("type1", "type2")
newCols <- colsplit(before$type, "_and_", newColNames)
after <- cbind(before, newCols)
after$type <- NULL
after
Swifty McSwifterton
  • 2,637
  • 1
  • 30
  • 37
7

base but probably slow:

n <- 1
for(i in strsplit(as.character(before$type),'_and_')){
     before[n, 'type_1'] <- i[[1]]
     before[n, 'type_2'] <- i[[2]]
     n <- n + 1
}

##   attr          type type_1 type_2
## 1    1   foo_and_bar    foo    bar
## 2   30 foo_and_bar_2    foo  bar_2
## 3    4   foo_and_bar    foo    bar
## 4    6 foo_and_bar_2    foo  bar_2
jpmorris
  • 882
  • 2
  • 10
  • 22
4

Another approach if you want to stick with strsplit() is to use the unlist() command. Here's a solution along those lines.

tmp <- matrix(unlist(strsplit(as.character(before$type), '_and_')), ncol=2,
   byrow=TRUE)
after <- cbind(before$attr, as.data.frame(tmp))
names(after) <- c("attr", "type_1", "type_2")
ashaw
  • 198
  • 5
1

Here is another base R solution. We can use read.table but since it accepts only one-byte sep argument and here we have multi-byte separator we can use gsub to replace the multibyte separator to any one-byte separator and use that as sep argument in read.table

cbind(before[1], read.table(text = gsub('_and_', '\t', before$type), 
                 sep = "\t", col.names = paste0("type_", 1:2)))

#  attr type_1 type_2
#1    1    foo    bar
#2   30    foo  bar_2
#3    4    foo    bar
#4    6    foo  bar_2

In this case, we can also make it shorter by replacing it with default sep argument so we don't have to mention it explicitly

cbind(before[1], read.table(text = gsub('_and_', ' ', before$type), 
                 col.names = paste0("type_", 1:2)))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Surprisingly, another tidyverse solution is still missing - you can also use tidyr::extract, with a regex.

library(tidyr)
before <- data.frame(attr = c(1, 30, 4, 6), type = c("foo_and_bar", "foo_and_bar_2"))

## regex - getting all characters except an underscore till the first underscore, 
## inspired by Akrun https://stackoverflow.com/a/49752920/7941188 

extract(before, col = type, into = paste0("type", 1:2), regex = "(^[^_]*)_(.*)")
#>   attr type1     type2
#> 1    1   foo   and_bar
#> 2   30   foo and_bar_2
#> 3    4   foo   and_bar
#> 4    6   foo and_bar_2
tjebo
  • 21,977
  • 7
  • 58
  • 94
0

Another base R solution that also is a general way to split a column in several columns is:

Data

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))

Procedure

attach(before)
before$type2 <- gsub("(\\w*)_and_(\\w*)", "c('\\1', '\\2')", type)
#this recode the column type to c("blah", "blah") form

cbind(before,t(sapply(1:nrow(before), function(x) eval(parse(text=before$type2[x])))))
#this split the desired column into several ones named 1 2 3 and so on

OUTPUT

  attr          type             type2   1     2
1    1   foo_and_bar   c('foo', 'bar') foo   bar
2   30 foo_and_bar_2 c('foo', 'bar_2') foo bar_2
3    4   foo_and_bar   c('foo', 'bar') foo   bar
4    6 foo_and_bar_2 c('foo', 'bar_2') foo bar_2
Alan Gómez
  • 211
  • 1
  • 6
0

Since this question was asked separate has been superseded by separate_longer_* and separate_wider_* functions.

The way to do it now is:

library(tidyr)
separate_wider_delim(before, type, delim = "_and_", names_sep = "_")

You could also use separate_wider_regex, but I'll leave that as an exercise to the reader :-)

Mark
  • 7,785
  • 2
  • 14
  • 34