134

I would like to split one column into two within at data frame based on a delimiter. For example,

a|b
b|c

to become

a    b
b    c

within a data frame.

Thanks!

Jeff Erickson
  • 3,783
  • 8
  • 36
  • 43
  • Related: http://stackoverflow.com/questions/7033187/sets-in-r-dataframe/7033933#7033933 – Chase Aug 15 '11 at 21:31

6 Answers6

126

@Taesung Shin is right, but then just some more magic to make it into a data.frame. I added a "x|y" line to avoid ambiguities:

df <- data.frame(ID=11:13, FOO=c('a|b','b|c','x|y'))
foo <- data.frame(do.call('rbind', strsplit(as.character(df$FOO),'|',fixed=TRUE)))

Or, if you want to replace the columns in the existing data.frame:

within(df, FOO<-data.frame(do.call('rbind', strsplit(as.character(FOO), '|', fixed=TRUE))))

Which produces:

  ID FOO.X1 FOO.X2
1 11      a      b
2 12      b      c
3 13      x      y
Tommy
  • 39,997
  • 12
  • 90
  • 85
  • 1
    What would you do if it's one column within a pre-existing large dataframe with 100s of columns? – Jeff Erickson Aug 15 '11 at 19:04
  • good question. I would do the split, make it into a data frame, rename it appropriately (the `rename` function from the `reshape` package is handy for doing this on the fly) and then `rbind` it with the existing data frame -- extra effort to get it inserted in place of the previous single column rather than as the first or last columns ... – Ben Bolker Aug 15 '11 at 19:07
  • 1
    I my updated version handles several 100s of other columns too. – Tommy Aug 15 '11 at 19:12
  • +1 This is the most generic solution. It even can handle the situations where there are different numbers of values separated by '|'. Of course, one has to deal with the result of rbind which may recycle for those rows with fewer elements. But the other solution would fail for such case. – Yu Shen Apr 17 '15 at 04:47
  • 4
    Be careful, `length(strsplit('a|', '|', fixed=TRUE))` is `1` -- as [the docs](https://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html) say: "...but if there is a match at the end of the string, the output is the same as with the match removed." As @YuShen says, this solution will "recycle". For me, I just wanted empty spaces, not recycling. – Nate Anderson Oct 26 '15 at 20:27
  • This does the job, thanks! (Just bear in mind that it's quite slow, so I'm trying other options too) – Rodrigo Duarte Nov 09 '20 at 19:40
  • If the number of items between rows is different (e.g., a|b, b|c, x|y|z), is there a way to fill what should be empty with NA? With this code, the separated values are repeated rather than empty – ltong Jun 10 '22 at 15:39
100

The tidyr package has a helper function for this, separate_wider_delim() (which superseded separate() in tidyr version 1.3.0 (2023-01-24):

df <- data.frame(ID=11:13, FOO=c('a|b', 'b|c', 'x|y'))
library(tidyr)
separate_wider_delim(df, cols = FOO, delim = "|", names = c("left", "right"))
# # A tibble: 3 × 3
#      ID left  right
#   <int> <chr> <chr>
# 1    11 a     b    
# 2    12 b     c    
# 3    13 x     y  

For tidyr versions < 1.3.0:

df <- data.frame(ID=11:13, FOO=c('a|b', 'b|c', 'x|y'))
separate(data = df, col = FOO, into = c("left", "right"), sep = "\\|")

#   ID left right
# 1 11    a     b
# 2 12    b     c
# 3 13    x     y

though in this case the defaults are smart enough to work (it looks for non-alphanumeric characters to split on).

separate(data = df, col = FOO, into = c("left", "right"))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
66

Hadley has a very elegant solution to do this inside data frames in his reshape package, using the function colsplit.

require(reshape)
> df <- data.frame(ID=11:13, FOO=c('a|b','b|c','x|y'))
> df
  ID FOO
1 11 a|b
2 12 b|c
3 13 x|y
> df = transform(df, FOO = colsplit(FOO, split = "\\|", names = c('a', 'b')))
> df
  ID FOO.a FOO.b
1 11     a     b
2 12     b     c
3 13     x     y
Community
  • 1
  • 1
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • 1
    Is there a way to get this without the resultant column names as "FOO.a" and "FOO.b" but just as "a" and "b" (easy enough to change, but just wondering...)? – Amyunimus Dec 09 '12 at 05:21
  • 19
    You could do the following `with(df, cbind(ID, colsplit(df$FOO, pattern = "\\|", names = c('a', 'b'))))`. Note that the parameter `split` has been renamed to `pattern` in `reshape2` which is a later version of the `reshape` package. – Ramnath Dec 09 '12 at 21:46
  • My edit is prevented at the moment... I would just update your answer to print df after each assignment... require(reshape) > df <- data.frame(ID=11:13, FOO=c('a|b','b|c','x|y')) > df ID FOO 1 11 a|b 2 12 b|c 3 13 x|y > df = transform(df, FOO = colsplit(FOO, split = "\\|", names = c('a', 'b'))) > df ID FOO.a FOO.b 1 11 a b 2 12 b c 3 13 x y – Nate Anderson Oct 26 '15 at 20:41
  • 6
    Also, is there a way to do it if you don't know the number of columns which would result? – Nate Anderson Oct 26 '15 at 20:42
  • 2
    Small problem with this solution. It splits the columns into a 'nested' dataframe, so that if you need to use the data for a plot using `ggplot2`, the column names are not recognized. I've found the `separate` function in `tidyr` more useful for this purpose. See answer by `@Gregor` – Alison Bennett Apr 20 '17 at 01:23
50

Just came across this question as it was linked in a recent question on SO.

Shameless plug of an answer: Use cSplit from my "splitstackshape" package:

df <- data.frame(ID=11:13, FOO=c('a|b','b|c','x|y'))
library(splitstackshape)
cSplit(df, "FOO", "|")
#   ID FOO_1 FOO_2
# 1 11     a     b
# 2 12     b     c
# 3 13     x     y

This particular function also handles splitting multiple columns, even if each column has a different delimiter:

df <- data.frame(ID=11:13, 
                 FOO=c('a|b','b|c','x|y'), 
                 BAR = c("A*B", "B*C", "C*D"))
cSplit(df, c("FOO", "BAR"), c("|", "*"))
#   ID FOO_1 FOO_2 BAR_1 BAR_2
# 1 11     a     b     A     B
# 2 12     b     c     B     C
# 3 13     x     y     C     D

Essentially, it's a fancy convenience wrapper for using read.table(text = some_character_vector, sep = some_sep) and binding that output to the original data.frame. In other words, another A base R approach could be:

df <- data.frame(ID=11:13, FOO=c('a|b','b|c','x|y'))
cbind(df, read.table(text = as.character(df$FOO), sep = "|"))
  ID FOO V1 V2
1 11 a|b  a  b
2 12 b|c  b  c
3 13 x|y  x  y
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    I particularly like that you don't have to define the column names the data is going 'into' – tospig Jul 31 '15 at 09:03
  • Agree with tospig. I also love that native method `read.table(c('a|b','c|d'), '|')` can do the trick if we're creating a new data.frame, even though SO talked about "within a [existing] data frame." – Nate Anderson Oct 26 '15 at 20:47
  • Downside of read.table, sep can only be one byte, and a hash separator is interpreted as a comment? `read.table(text=c('a#b'), sep='#')` produces only one column -- I expected two columns. – Nate Anderson Oct 26 '15 at 21:05
  • @TheRedPea, on a phone right now, but can't you get the result you want by also specifying the comment character in `read.table` as `""`? – A5C1D2H2I1M1N2O1R2T1 Oct 27 '15 at 00:08
  • @AnandaMahto you're absolutely correct; this produces 2 columns: `read.table(text=c('a#b'), sep='#', comment.char = '')` and as for the single-character delimiter, if I pass the c('a~~b') into gsub I can get the single-byte delimiter I need: `gsub('~~','~', c('a~~b'))` – Nate Anderson Oct 27 '15 at 04:15
12
strsplit(c('a|b','b|c'),'|',fixed=TRUE)
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Tae-Sung Shin
  • 20,215
  • 33
  • 138
  • 240
5

Combining @Ramnath and @Tommy's answers allowed me to find an approach that works in base R for one or more columns.

Basic usage:

> df = data.frame(
+   id=1:3, foo=c('a|b','b|c','c|d'), 
+   bar=c('p|q', 'r|s', 's|t'), stringsAsFactors=F)
> transform(df, test=do.call(rbind, strsplit(foo, '|', fixed=TRUE)), stringsAsFactors=F)
  id foo bar test.1 test.2
1  1 a|b p|q      a      b
2  2 b|c r|s      b      c
3  3 c|d s|t      c      d

Multiple columns:

> transform(df, lapply(list(foo,bar),
+ function(x)do.call(rbind, strsplit(x, '|', fixed=TRUE))), stringsAsFactors=F)
  id foo bar X1 X2 X1.1 X2.1
1  1 a|b p|q  a  b    p    q
2  2 b|c r|s  b  c    r    s
3  3 c|d s|t  c  d    s    t

Better naming of multiple split columns:

> transform(df, lapply({l<-list(foo,bar);names(l)=c('foo','bar');l}, 
+                          function(x)do.call(rbind, strsplit(x, '|', fixed=TRUE))), stringsAsFactors=F)
  id foo bar foo.1 foo.2 bar.1 bar.2
1  1 a|b p|q     a     b     p     q
2  2 b|c r|s     b     c     r     s
3  3 c|d s|t     c     d     s     t
JS1204
  • 344
  • 2
  • 10