Split data frame string column into multiple columns

Question

I'd like to take data of the form

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
  attr          type
1    1   foo_and_bar
2   30 foo_and_bar_2
3    4   foo_and_bar
4    6 foo_and_bar_2

and use split() on the column "type" from above to get something like this:

  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

I came up with something unbelievably complex involving some form of apply that worked, but I've since misplaced that. It seemed far too complicated to be the best way. I can use strsplit as below, but then unclear how to get that back into 2 columns in the data frame.

> strsplit(as.character(before$type),'_and_')
[[1]]
[1] "foo" "bar"

[[2]]
[1] "foo"   "bar_2"

[[3]]
[1] "foo" "bar"

[[4]]
[1] "foo"   "bar_2"

Thanks for any pointers. I've not quite groked R lists just yet.

score 353 · Accepted Answer · answered Dec 04 '10 at 04:21

353

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

answered Dec 04 '10 at 04:21

hadley

102,019
32
183
245

4

this worked pretty fine for my problem today as well.. but it was adding a 'c' at the beginning of each row. Any idea why is that??? `left_right <- str_split_fixed(as.character(split_df),'\">',2)` – LearneR Jul 28 '15 at 06:53
I would like to split with a pattern that has "...", when I apply that function, it returns nothing. What could be the problem. my type is something like "test...score" – user3841581 Mar 14 '16 at 08:15
5

@user3841581 - old query of yours I know, but this is covered in the documentation - `str_split_fixed("aaa...bbb", fixed("..."), 2)` works fine with `fixed()` to "Match a fixed string" in the `pattern=` argument. `.` means 'any character' in regex. – thelatemail Aug 09 '17 at 04:30
Thanks hadley, very convinient method, but there is one thing can be improved, if there is NA in the original column, after separation it will become sevaral empty string in result columns, which is unwanted, I want to keep the NA still NA after separation – cloudscomputes Sep 15 '17 at 03:28
Works well i.e. if the separator is missing ! i.e. if I have a vector 'a<-c("1N", "2N")' that I would like to separate in columns '1,1, "N", "N"' I run 'str_split_fixed(s, "", 2)'. I am just not sure how to name my new columns in this approach, 'col1<-c(1,1)' and 'col2<-c("N", "N")' – maycca May 22 '18 at 19:32
This wasn't directly the question but you can use the following `|` to use several strings that you want to search for. For instance, for `_and_` and `_or_`, the code would be: `str_split_fixed(before$type, c("_and_|_or_"), 2)` – Martin Sep 22 '21 at 12:22

score 267 · Answer 2 · answered Jun 11 '14 at 16:50

267

Another option is to use the new tidyr package.

library(dplyr)
library(tidyr)

before <- data.frame(
  attr = c(1, 30 ,4 ,6 ), 
  type = c('foo_and_bar', 'foo_and_bar_2')
)

before %>%
  separate(type, c("foo", "bar"), "_and_")

##   attr foo   bar
## 1    1 foo   bar
## 2   30 foo bar_2
## 3    4 foo   bar
## 4    6 foo bar_2

answered Jun 11 '14 at 16:50

hadley

102,019
32
183
245

4

Is there a way to limit number of splits with separate? Let's say I want to split on '_' only once (or do it with `str_split_fixed` and adding columns to existing dataframe)? – JelenaČuklina Jan 11 '16 at 11:42
@hadley How about if I want to split based on second `_`? I want the values as `foo_and`, `bar`/`bar_2`? – Prradep Nov 08 '21 at 13:07
1

`tidyr::separate` has been superseded by `tidyr::separate_wider_delim`. – robertspierre Jul 11 '23 at 20:32

David Arenburg · Answer 3 · 2016-08-22T07:47:42.003

92

5 years later adding the obligatory data.table solution

library(data.table) ## v 1.9.6+ 
setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]
before
#    attr          type type1 type2
# 1:    1   foo_and_bar   foo   bar
# 2:   30 foo_and_bar_2   foo bar_2
# 3:    4   foo_and_bar   foo   bar
# 4:    6 foo_and_bar_2   foo bar_2

We could also both make sure that the resulting columns will have correct types and improve performance by adding type.convert and fixed arguments (since "_and_" isn't really a regex)

setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]

edited Aug 22 '16 at 07:47

answered Oct 14 '15 at 14:14

David Arenburg

91,361
17
137
196

1

if the number of your `'_and_'` patterns vary, you can find out the maximum number of matches (i.e. future columns) with `max(lengths(strsplit(before$type, '_and_')))` – andschar Jun 04 '19 at 10:41
This is my favorite answer, works very well! Could you please explain how it works. Why transpose(strsplit(…)) and isn't paste0 for concatenating strings - not splitting them... – Gecko May 14 '20 at 11:00
3

@Gecko I'm not sure what is the question. If you just use `strsplit` it creates a single vector with 2 values in each slot, so `tstrsplit` transposes it into 2 vectors with a single value in each. `paste0` is just used in order to create the column names, it is not used on the values. On the LHS of the equation are the column names, on the RHS is the split + transpose operation on the column. `:=` stands for "*assign in place*", hence you don't see the `<-` assignment operator there. – David Arenburg May 14 '20 at 11:46

score 73 · Answer 4 · answered Dec 04 '10 at 00:51

73

Yet another approach: use rbind on out:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  
out <- strsplit(as.character(before$type),'_and_') 
do.call(rbind, out)

     [,1]  [,2]   
[1,] "foo" "bar"  
[2,] "foo" "bar_2"
[3,] "foo" "bar"  
[4,] "foo" "bar_2"

And to combine:

data.frame(before$attr, do.call(rbind, out))

answered Dec 04 '10 at 00:51

Aniko

18,516
4
48
45

7

Another alternative on newer R versions is `strcapture("(.*)_and_(.*)", as.character(before$type), data.frame(type_1 = "", type_2 = ""))` – alexis_laz Nov 10 '16 at 18:23
This is the correct solution. Simple and doesn't require thrid-party packages. – Cole Apr 04 '23 at 01:55

score 45 · Answer 5 · edited Sep 02 '16 at 05:20

Notice that sapply with "[" can be used to extract either the first or second items in those lists so:

before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)
before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)
before$type <- NULL

And here's a gsub method:

before$type_1 <- gsub("_and_.+$", "", before$type)
before$type_2 <- gsub("^.+_and_", "", before$type)
before$type <- NULL

score 34 · Answer 6 · answered Dec 04 '10 at 02:09

34

here is a one liner along the same lines as aniko's solution, but using hadley's stringr package:

do.call(rbind, str_split(before$type, '_and_'))

answered Dec 04 '10 at 02:09

Ramnath

54,439
16
125
152

1

Good catch, best solution for me. Though a bit slower than with the `stringr` package. – Melka Mar 30 '16 at 11:34
did this function get renamed to `strsplit()` ? – user5359531 Feb 12 '21 at 13:58

score 30 · Answer 7 · edited Jul 12 '17 at 12:34

30

To add to the options, you could also use my splitstackshape::cSplit function like this:

library(splitstackshape)
cSplit(before, "type", "_and_")
#    attr type_1 type_2
# 1:    1    foo    bar
# 2:   30    foo  bar_2
# 3:    4    foo    bar
# 4:    6    foo  bar_2

edited Jul 12 '17 at 12:34

David Arenburg

91,361
17
137
196

answered Sep 27 '14 at 15:46

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

1

3 years later - this option is working best for a similar problem I have - however the dataframe I am working with has 54 columns and I need to split all of them into two. Is there a way to do this using this method - short of typing out the above command 54 times? Many thanks, Nicki. – Nicki Aug 03 '17 at 13:21
@Nicki, Have you tried providing a vector of the column names or the column positions? That should do it.... – A5C1D2H2I1M1N2O1R2T1 Aug 04 '17 at 16:12
It wasnt just renaming the columns - I needed to literally split the columns as above effectively doubling the number of columns in my df. The below was what I used in the end: df2 <- cSplit(df1, splitCols = 1:54, "/") – Nicki Aug 07 '17 at 13:20

score 29 · Answer 8 · answered Nov 01 '17 at 17:26

The subject is almost exhausted, I 'd like though to offer a solution to a slightly more general version where you don't know the number of output columns, a priori. So for example you have

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar'))
  attr                    type
1    1             foo_and_bar
2   30           foo_and_bar_2
3    4 foo_and_bar_2_and_bar_3
4    6             foo_and_bar

We can't use dplyr separate() because we don't know the number of the result columns before the split, so I have then created a function that uses stringr to split a column, given the pattern and a name prefix for the generated columns. I hope the coding patterns used, are correct.

split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as.tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]

  names(cols) <- paste(into_prefix, 1:m, sep = "_")
  return(cols)
}

We can then use split_into_multiple in a dplyr pipe as follows:

after <- before %>% 
  bind_cols(split_into_multiple(.$type, "_and_", "type")) %>% 
  # selecting those that start with 'type_' will remove the original 'type' column
  select(attr, starts_with("type_"))

>after
  attr type_1 type_2 type_3
1    1    foo    bar   <NA>
2   30    foo  bar_2   <NA>
3    4    foo  bar_2  bar_3
4    6    foo    bar   <NA>

And then we can use gather to tidy up...

after %>% 
  gather(key, val, -attr, na.rm = T)

   attr    key   val
1     1 type_1   foo
2    30 type_1   foo
3     4 type_1   foo
4     6 type_1   foo
5     1 type_2   bar
6    30 type_2 bar_2
7     4 type_2 bar_2
8     6 type_2   bar
11    4 type_3 bar_3

This is very useful. Years later, I am wondering if it is possible to introduce colnames that can be inserted in a for loop. For example, I want to split_into_multiple 10 columns (or more) and I don't want to split them each column at a time. I want the resulting split columns to bind together. I can't find the way to programatically select the column names (it gives me error) the option !!as.name() says there's not such attribute... How would you do it? — MEC, Feb 21 '23 at 19:56
I am glad you still find it useful @MEC. My R knowledge has abandoned me and don't know how to hint you with this part. I think Python `pandas` might offer something but needs some homework... — Yannis P., Mar 18 '23 at 17:46

score 17 · Answer 9 · answered Dec 03 '10 at 23:36

An easy way is to use sapply() and the [ function:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
out <- strsplit(as.character(before$type),'_and_')

For example:

> data.frame(t(sapply(out, `[`)))
   X1    X2
1 foo   bar
2 foo bar_2
3 foo   bar
4 foo bar_2

sapply()'s result is a matrix and needs transposing and casting back to a data frame. It is then some simple manipulations that yield the result you wanted:

after <- with(before, data.frame(attr = attr))
after <- cbind(after, data.frame(t(sapply(out, `[`))))
names(after)[2:3] <- paste("type", 1:2, sep = "_")

At this point, after is what you wanted

> after
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

Rich Scriven · Answer 10 · 2017-08-28T19:21:12.163

10

Since R version 3.4.0 you can use strcapture() from the utils package (included with base R installs), binding the output onto the other column(s).

out <- strcapture(
    "(.*)_and_(.*)",
    as.character(before$type),
    data.frame(type_1 = character(), type_2 = character())
)

cbind(before["attr"], out)
#   attr type_1 type_2
# 1    1    foo    bar
# 2   30    foo  bar_2
# 3    4    foo    bar
# 4    6    foo  bar_2

edited Aug 28 '17 at 19:21

answered Aug 28 '17 at 19:15

Rich Scriven

97,041
11
181
245

This is the most coherent 'built for purpose'. – Chris Dec 24 '22 at 02:40

score 9 · Answer 11 · answered Jul 22 '16 at 20:34

Here is a base R one liner that overlaps a number of previous solutions, but returns a data.frame with the proper names.

out <- setNames(data.frame(before$attr,
                  do.call(rbind, strsplit(as.character(before$type),
                                          split="_and_"))),
                  c("attr", paste0("type_", 1:2)))
out
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

It uses strsplit to break up the variable, and data.frame with do.call/rbind to put the data back into a data.frame. The additional incremental improvement is the use of setNames to add variable names to the data.frame.

score 7 · Answer 12 · answered Sep 28 '17 at 20:14

7

This question is pretty old but I'll add the solution I found the be the simplest at present.

library(reshape2)
before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
newColNames <- c("type1", "type2")
newCols <- colsplit(before$type, "_and_", newColNames)
after <- cbind(before, newCols)
after$type <- NULL
after

answered Sep 28 '17 at 20:14

Swifty McSwifterton

2,637
1
30
37

This is by far the easiest when it comes to managing df vectors – Apricot Jun 17 '19 at 04:38

score 7 · Answer 13 · answered Feb 17 '18 at 03:44

base but probably slow:

n <- 1
for(i in strsplit(as.character(before$type),'_and_')){
     before[n, 'type_1'] <- i[[1]]
     before[n, 'type_2'] <- i[[2]]
     n <- n + 1
}

##   attr          type type_1 type_2
## 1    1   foo_and_bar    foo    bar
## 2   30 foo_and_bar_2    foo  bar_2
## 3    4   foo_and_bar    foo    bar
## 4    6 foo_and_bar_2    foo  bar_2

score 4 · Answer 14 · answered Dec 03 '10 at 23:52

Another approach if you want to stick with strsplit() is to use the unlist() command. Here's a solution along those lines.

tmp <- matrix(unlist(strsplit(as.character(before$type), '_and_')), ncol=2,
   byrow=TRUE)
after <- cbind(before$attr, as.data.frame(tmp))
names(after) <- c("attr", "type_1", "type_2")

Ronak Shah · Answer 15 · 2020-01-04T02:28:37.787

Here is another base R solution. We can use read.table but since it accepts only one-byte sep argument and here we have multi-byte separator we can use gsub to replace the multibyte separator to any one-byte separator and use that as sep argument in read.table

cbind(before[1], read.table(text = gsub('_and_', '\t', before$type), 
                 sep = "\t", col.names = paste0("type_", 1:2)))

#  attr type_1 type_2
#1    1    foo    bar
#2   30    foo  bar_2
#3    4    foo    bar
#4    6    foo  bar_2

In this case, we can also make it shorter by replacing it with default sep argument so we don't have to mention it explicitly

cbind(before[1], read.table(text = gsub('_and_', ' ', before$type), 
                 col.names = paste0("type_", 1:2)))

score 1 · Answer 16 · answered Feb 09 '22 at 18:51

Surprisingly, another tidyverse solution is still missing - you can also use tidyr::extract, with a regex.

library(tidyr)
before <- data.frame(attr = c(1, 30, 4, 6), type = c("foo_and_bar", "foo_and_bar_2"))

## regex - getting all characters except an underscore till the first underscore, 
## inspired by Akrun https://stackoverflow.com/a/49752920/7941188 

extract(before, col = type, into = paste0("type", 1:2), regex = "(^[^_]*)_(.*)")
#>   attr type1     type2
#> 1    1   foo   and_bar
#> 2   30   foo and_bar_2
#> 3    4   foo   and_bar
#> 4    6   foo and_bar_2

score 0 · Answer 17 · answered Jul 14 '23 at 12:59

Another base R solution that also is a general way to split a column in several columns is:

Data

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))

Procedure

attach(before)
before$type2 <- gsub("(\\w*)_and_(\\w*)", "c('\\1', '\\2')", type)
#this recode the column type to c("blah", "blah") form

cbind(before,t(sapply(1:nrow(before), function(x) eval(parse(text=before$type2[x])))))
#this split the desired column into several ones named 1 2 3 and so on

OUTPUT

  attr          type             type2   1     2
1    1   foo_and_bar   c('foo', 'bar') foo   bar
2   30 foo_and_bar_2 c('foo', 'bar_2') foo bar_2
3    4   foo_and_bar   c('foo', 'bar') foo   bar
4    6 foo_and_bar_2 c('foo', 'bar_2') foo bar_2

Mark · Answer 18 · 2023-08-28T13:20:51.100

0

Since this question was asked separate has been superseded by separate_longer_* and separate_wider_* functions.

The way to do it now is:

library(tidyr)
separate_wider_delim(before, type, delim = "_and_", names_sep = "_")

You could also use separate_wider_regex, but I'll leave that as an exercise to the reader :-)

edited Aug 28 '23 at 13:20

answered Jul 21 '23 at 11:09

Mark

7,785
2
14
34

Split data frame string column into multiple columns

18 Answers18

Linked

Related