125

I have a bunch of columns in a dataframe which I want to paste together (seperated by "-") as follows:

data <- data.frame('a' = 1:3, 
                   'b' = c('a','b','c'), 
                   'c' = c('d', 'e', 'f'), 
                   'd' = c('g', 'h', 'i'))
i.e.     
     a   b   c  d  
     1   a   d   g  
     2   b   e   h  
     3   c   f   i  

Which I want to become:

a x  
1 a-d-g  
2 b-e-h  
3 c-f-i  

I could normally do this with:

within(data, x <- paste(b,c,d,sep='-'))

and then removing the old columns, but unfortunately I do not know the names of the columns specifically, only a collective name for all of the columns, e.g. I would know that cols <- c('b','c','d')

Does anyone know a way of doing this?

Jaap
  • 81,064
  • 34
  • 182
  • 193
user1165199
  • 6,351
  • 13
  • 44
  • 60

11 Answers11

124
# your starting data..
data <- data.frame('a' = 1:3, 'b' = c('a','b','c'), 'c' = c('d', 'e', 'f'), 'd' = c('g', 'h', 'i')) 

# columns to paste together
cols <- c( 'b' , 'c' , 'd' )

# create a new column `x` with the three columns collapsed together
data$x <- apply( data[ , cols ] , 1 , paste , collapse = "-" )

# remove the unnecessary columns
data <- data[ , !( names( data ) %in% cols ) ]
ShivaT
  • 3
  • 3
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • 9
    no need for apply here; paste is vectorised, and that's more efficient – baptiste Jan 28 '13 at 21:49
  • 2
    @baptiste ..possible without `do.call`? – Anthony Damico Jan 28 '13 at 23:11
  • 2
    sure, you could for example use `evil(parse(...))`, but I believe `do.call` is the right call here. – baptiste Jan 28 '13 at 23:35
  • Do.call here is the better technique; maintains the vectorization. – Clayton Stanley Jan 30 '13 at 06:30
  • 2
    hmm.. how would you pass the `collapse = "-"` through? to `paste`? – Anthony Damico May 13 '14 at 09:01
  • Funnily when I merge 2 columns (1 factorized and 1 numeric), where second column goes from 1 to 12 (2 digit!), I get a space between separator and second part e.g. "Blue_ 1", "Blue_ 2" instead of "Blue_1", "Blue_2". This problem doesn't occur with do.call. If the second column is only single digits, no space is introduced. – Saren Tasciyan May 13 '20 at 11:00
  • worked perfectly. it does need apply or some variant of it to concatenate several columns for which column names are unknown into a single column. [Other solutions](https://stackoverflow.com/questions/6984796/how-to-paste-a-string-on-each-element-of-a-vector-of-strings-using-apply-in-r) did not work for me. Also [lapply](https://stackoverflow.com/questions/29723493/apply-paste-over-a-list-of-vectors-to-get-a-list-of-strings) did not work. – Simone Nov 01 '21 at 16:12
  • @baptiste if you mean paste(data, collapse = "-" ) that would return a single string – M.L. Jun 03 '23 at 15:07
  • Could yo please explain what 1 means here? `apply( data[ , cols ] , 1 , paste , collapse = "-" )` – Rara Jul 06 '23 at 07:14
57

As a variant on baptiste's answer, with data defined as you have and the columns that you want to put together defined in cols

cols <- c("b", "c", "d")

You can add the new column to data and delete the old ones with

data$x <- do.call(paste, c(data[cols], sep="-"))
for (co in cols) data[co] <- NULL

which gives

> data
  a     x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
Community
  • 1
  • 1
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
50

Using tidyr package, this can be easily handled in 1 function call.

data <- data.frame('a' = 1:3, 
                   'b' = c('a','b','c'), 
                   'c' = c('d', 'e', 'f'), 
                   'd' = c('g', 'h', 'i'))

tidyr::unite_(data, paste(colnames(data)[-1], collapse="_"), colnames(data)[-1])

  a b_c_d
1 1 a_d_g
2 2 b_e_h
3 3 c_f_i

Edit: Exclude first column, everything else gets pasted.

# tidyr_0.6.3

unite(data, newCol, -a) 
# or by column index unite(data, newCol, -1)

#   a newCol
# 1 1  a_d_g
# 2 2  b_e_h
# 3 3  c_f_i
zx8754
  • 52,746
  • 12
  • 114
  • 209
data_steve
  • 1,548
  • 12
  • 17
  • 4
    I think OP mentioned they don't know the column name in advance., otherwise the could do it with just `within(data, x <- paste(b,c,d,sep='-'))` as they illustrated. – David Arenburg Oct 08 '15 at 10:48
  • I agree with @DavidArenburg, this doesn't address the OP's situation. I think `unite_(data, "b_c_d", cols)` would, or depending on their actual data.frame, `unite(data, b_c_d, -a)` might be a candidate as well. – Sam Firke Oct 08 '15 at 12:33
15

I'd construct a new data.frame:

d <- data.frame('a' = 1:3, 'b' = c('a','b','c'), 'c' = c('d', 'e', 'f'), 'd' = c('g', 'h', 'i')) 

cols <- c( 'b' , 'c' , 'd' )

data.frame(a = d[, 'a'], x = do.call(paste, c(d[ , cols], list(sep = '-'))))
baptiste
  • 75,767
  • 19
  • 198
  • 294
  • note that instead of `d[ , cols]` you may want to use `d[ , names(d) != 'a']` if all but the `a` column are to be pasted together. – baptiste Jan 28 '13 at 18:39
  • 3
    One of the canonical solutions on SO, I think you could shorten this to `cbind(a = d['a'], x = do.call(paste, c(d[cols], sep = '-')))`, e.g. avoid the commas, `list` and `data.frame` while using the `data.frame` method of `cbind` – David Arenburg Oct 08 '15 at 12:05
10

Just to add additional solution with Reduce which probably is slower than do.call but probebly better than apply because it will avoid the matrix conversion. Also, instead a for loop we could just use setdiff in order to remove unwanted columns

cols <- c('b','c','d')
data$x <- Reduce(function(...) paste(..., sep = "-"), data[cols])
data[setdiff(names(data), cols)]
#   a     x
# 1 1 a-d-g
# 2 2 b-e-h
# 3 3 c-f-i

Alternatively we could update data in place using the data.table package (assuming fresh data)

library(data.table)
setDT(data)[, x := Reduce(function(...) paste(..., sep = "-"), .SD[, mget(cols)])]
data[, (cols) := NULL]
data
#    a     x
# 1: 1 a-d-g
# 2: 2 b-e-h
# 3: 3 c-f-i

Another option is to use .SDcols instead of mget as in

setDT(data)[, x := Reduce(function(...) paste(..., sep = "-"), .SD), .SDcols = cols]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
10

In my opinion the sprintf-function deserves a place among these answers as well. You can use sprintf as follows:

do.call(sprintf, c(d[cols], '%s-%s-%s'))

which gives:

 [1] "a-d-g" "b-e-h" "c-f-i"

And to create the required dataframe:

data.frame(a = d$a, x = do.call(sprintf, c(d[cols], '%s-%s-%s')))

giving:

  a     x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i

Although sprintf doesn't have a clear advantage over the do.call/paste combination of @BrianDiggs, it is especially usefull when you also want to pad certain parts of desired string or when you want to specify the number of digit. See ?sprintf for the several options.

Another variant would be to use pmap from :

pmap(d[2:4], paste, sep = '-')

Note: this pmap solution only works when the columns aren't factors.


A benchmark on a larger dataset:

# create a larger dataset
d2 <- d[sample(1:3,1e6,TRUE),]
# benchmark
library(microbenchmark)
microbenchmark(
  docp = do.call(paste, c(d2[cols], sep="-")),
  appl = apply( d2[, cols ] , 1 , paste , collapse = "-" ),
  tidr = tidyr::unite_(d2, "x", cols, sep="-")$x,
  docs = do.call(sprintf, c(d2[cols], '%s-%s-%s')),
  times=10)

results in:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval cld
 docp  214.1786  226.2835  297.1487  241.6150  409.2495  493.5036    10 a  
 appl 3832.3252 4048.9320 4131.6906 4072.4235 4255.1347 4486.9787    10   c
 tidr  206.9326  216.8619  275.4556  252.1381  318.4249  407.9816    10 a  
 docs  413.9073  443.1550  490.6520  453.1635  530.1318  659.8400    10  b 

Used data:

d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i')) 
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    This is my preferred answer since sprintf allows (like paste) to insert a different string between each variables, not just one like "-" or "_", so having like sprintf, c(d2[cols], '%s (%s) some other text here %s ') – lokxs Dec 22 '22 at 11:28
8

I benchmarked the answers of Anthony Damico, Brian Diggs and data_steve on a small sample tbl_df and got the following results.

> data <- data.frame('a' = 1:3, 
+                    'b' = c('a','b','c'), 
+                    'c' = c('d', 'e', 'f'), 
+                    'd' = c('g', 'h', 'i'))
> data <- tbl_df(data)
> cols <- c("b", "c", "d")
> microbenchmark(
+     do.call(paste, c(data[cols], sep="-")),
+     apply( data[ , cols ] , 1 , paste , collapse = "-" ),
+     tidyr::unite_(data, "x", cols, sep="-")$x,
+     times=1000
+ )
Unit: microseconds
                                         expr     min      lq      mean  median       uq       max neval
do.call(paste, c(data[cols], sep = "-"))       65.248  78.380  93.90888  86.177  99.3090   436.220  1000
apply(data[, cols], 1, paste, collapse = "-") 223.239 263.044 313.11977 289.514 338.5520   743.583  1000
tidyr::unite_(data, "x", cols, sep = "-")$x   376.716 448.120 556.65424 501.877 606.9315 11537.846  1000

However, when I evaluated on my own tbl_df with ~1 million rows and 10 columns the results were quite different.

> microbenchmark(
+     do.call(paste, c(data[c("a", "b")], sep="-")),
+     apply( data[ , c("a", "b") ] , 1 , paste , collapse = "-" ),
+     tidyr::unite_(data, "c", c("a", "b"), sep="-")$c,
+     times=25
+ )
Unit: milliseconds
                                                       expr        min         lq      mean     median        uq       max neval
do.call(paste, c(data[c("a", "b")], sep="-"))                 930.7208   951.3048  1129.334   997.2744  1066.084  2169.147    25
apply( data[ , c("a", "b") ] , 1 , paste , collapse = "-" )  9368.2800 10948.0124 11678.393 11136.3756 11878.308 17587.617    25
tidyr::unite_(data, "c", c("a", "b"), sep="-")$c              968.5861  1008.4716  1095.886  1035.8348  1082.726  1759.349    25
ChristopherTull
  • 464
  • 1
  • 7
  • 11
6

Here's a fairly unconventional (but fast) approach: use fwrite from data.table to "paste" the columns together, and fread to read it back in. For convenience, I've written the steps as a function called fpaste:

fpaste <- function(dt, sep = ",") {
  x <- tempfile()
  fwrite(dt, file = x, sep = sep, col.names = FALSE)
  fread(x, sep = "\n", header = FALSE)
}

Here's an example:

d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i')) 
cols = c("b", "c", "d")

fpaste(d[cols], "-")
#       V1
# 1: a-d-g
# 2: b-e-h
# 3: c-f-i

How does it perform?

d2 <- d[sample(1:3,1e6,TRUE),]
  
library(microbenchmark)
microbenchmark(
  docp = do.call(paste, c(d2[cols], sep="-")),
  tidr = tidyr::unite_(d2, "x", cols, sep="-")$x,
  docs = do.call(sprintf, c(d2[cols], '%s-%s-%s')),
  appl = apply( d2[, cols ] , 1 , paste , collapse = "-" ),
  fpaste = fpaste(d2[cols], "-")$V1,
  dt2 = as.data.table(d2)[, x := Reduce(function(...) paste(..., sep = "-"), .SD), .SDcols = cols][],
  times=10)
# Unit: milliseconds
#    expr        min         lq      mean     median         uq       max neval
#    docp  215.34536  217.22102  220.3603  221.44104  223.27224  225.0906    10
#    tidr  215.19907  215.81210  220.7131  220.09636  225.32717  229.6822    10
#    docs  281.16679  285.49786  289.4514  286.68738  290.17249  312.5484    10
#    appl 2816.61899 3106.19944 3259.3924 3266.45186 3401.80291 3804.7263    10
#  fpaste   88.57108   89.67795  101.1524   90.59217   91.76415  197.1555    10
#     dt2  301.95508  310.79082  384.8247  316.29807  383.94993  874.4472    10
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • What if you write and read to ramdisk? Comparison would be a little bit more fair. – jangorecki Jun 23 '20 at 21:44
  • @jangorecki, not sure if I'm doing it correctly (I started R with `TMPDIR=/dev/shm R`) but I don't notice a huge difference in comparison to these results. I also haven't played around at all with the number of threads used for `fread` or `fwrite` to see how it affects the results. – A5C1D2H2I1M1N2O1R2T1 Jun 24 '20 at 21:52
5

Simple and straightforward code with unite from {tidyr} v1.2.0

Solution with {tidyr v1.2.0}

library(tidyr)

data %>% unite("x", all_of(cols), remove = T, sep = "-")
  • "x" is the name of the new column.
  • all_of(cols) is a selection of what columns we want to merge. Using <tidy-select> the column names don't need to be hardcoded in.
  • remove = T we remove the input columns
  • sep = "-" we define the separator between values
  • if there is NA, we can also add na.rm = TRUE

Output

#   a     x
# 1 1 a-d-g
# 2 2 b-e-h
# 3 3 c-f-i

Input Data

data <- data.frame('a' = 1:3, 
                   'b' = c('a','b','c'), 
                   'c' = c('d', 'e', 'f'), 
                   'd' = c('g', 'h', 'i'))
cols <- c('b','c','d')
data

#   a b c d
# 1 1 a d g
# 2 2 b e h
# 3 3 c f i

*This solution is different from what already posted.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
Ruam Pimentel
  • 1,288
  • 4
  • 16
2

I know this is an old question, but thought that I should anyway present the simple solution using the paste() function as suggested to by the questioner:

data_1<-data.frame(a=data$a,"x"=paste(data$b,data$c,data$d,sep="-")) 
data_1
  a     x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
0
library(plyr)

ldply(apply(data, 1, function(x) data.frame(
                      x = paste(x[2:4],sep="",collapse="-"))))

#      x
#1 a-d-g
#2 b-e-h
#3 c-f-i

#  and with just the vector of names you have:

ldply(apply(data, 1, function(x) data.frame(
                      x = paste(x[c('b','c','d')],sep="",collapse="-"))))

# or equally:
mynames <-c('b','c','d')
ldply(apply(data, 1, function(x) data.frame(
                      x = paste(x[mynames],sep="",collapse="-"))))    
user1317221_G
  • 15,087
  • 3
  • 52
  • 78