Concatenate vector elements in groups

Question

What is the most elegant way of converting list1 to list2, and also list2 to list1?

list1<- c('a','b','c','d','e','f','g','h','i')
list2<- c('abc','def','ghi')

i.e: contactenate elements in groups of three.

thanks :D

score 3 · Accepted Answer · edited May 23 '17 at 11:57

Let list1 <- letters[1:10] (to show how it works when the length of the vector is not a multiple of 3). Then, try this:

list1 to list2

# method 1 (seems to be the fastest so far, 
# my suspicions about loop being slower were wrong)
list2 <- sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = "")
# alternatively as @flodel mentions
list2 <- tapply(list1, (seq_along(list1)-1) %/% 3, paste, collapse = "")

The tapply version runs at a similar time as sapply+split (benchmarking not shown).

Going one step further, using @JoshOBrien's idea in this post

# method 2
pattern <- "(?<=[[:alnum:]]{3})(?=[[:alnum:]])"
strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]]
# [1] "abc" "def" "ghi" "j"

And if you want to get the last part concatenated to the last-but-one (here the j to ghi) then, do:

pattern <- "(?<=[[:alnum:]]{3})(?=[[:alnum:]]{3})"
strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]]
# [1] "abc"  "def"  "ghij"

list2 to list1

unlist(strsplit(list2, ""), use.names=FALSE)
#  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Here's a benchmarking of method1, method2 and eddi's:

data:

list1 <- sample(letters, 1e5, replace=TRUE)

functions:

arun <- function() {
    pattern <- "(?<=[[:alnum:]]{3})(?=[[:alnum:]])"
    strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]]
}

arun2 <- function() {
    unname(sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""))
}

eddi <- function() {
    substring(paste(list1, collapse = ""),
          seq(1, length(list1), 3),
          pmin(seq(3, length(list1)+2, 3), length(list1)))    
}

benchmarking:

require(microbenchmark)
microbenchmark(t1 <- arun(), t2 <- eddi(), t3 <- arun2(), times=10)
identical(t1, t2) # TRUE
identical(t1, t3) # TRUE

# Unit: milliseconds
#           expr       min        lq    median        uq       max neval
#   t1 <- arun() 3352.9867 3400.8627 3512.7037 3585.6499 3635.2182    10
#   t2 <- eddi() 3302.0925 3318.4184 3356.2109 3409.9728 3487.7220    10
#  t3 <- arun2()  474.9235  494.7407  539.4406  641.2605  907.9072    10

eddi · Answer 2 · 2013-05-07T15:09:06.217

Here's another version, that's faster than both of @Arun's methods (imo at the expense of readability as compared to his method 1, which is unfortunately much much slower than his method 2 or this) [edit: after some benchmarking it seems like Arun's first method while not doing so well at small-medium size, actually scales much better, winning at larger sizes] [[another edit: the Grothendieck solution is another one that doesn't do well at small size, but scales even better than Arun's first method]]:

substring(paste(list1, collapse = ""),
          seq(1, length(list1), 3),
          pmin(seq(3, length(list1)+2, 3), length(list1)))

benchmark

list1 = sample(letters, 10000, replace = T)
microbenchmark(eddi=substring(paste(list1, collapse = ""),seq(1, length(list1), 3),pmin(seq(3, length(list1)+2, 3), length(list1))),
               Arun1=sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""),
               Arun2=strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]],
               Grothendieck=apply(matrix(c(list1, rep("", (3 - length(list1) %% 3) %% 3)), 3), 2, paste, collapse = ""),
               times = 100)
#Unit: milliseconds
#         expr       min       lq   median       uq      max neval
#         eddi  8.804764 10.17807 11.33133 11.58993 12.69495   100
#        Arun1 51.287326 61.74937 65.51151 67.15510 73.98805   100
#        Arun2 12.305300 13.52000 14.65123 15.00816 17.20151   100
# Grothendieck 25.043657 29.15488 29.87843 31.02118 45.85889   100

benchmarks continued This is somewhat interesting, at 1e5, Arun1 actually edges out the other two slightly:

list1 = sample(letters, 1e5, replace = T)
microbenchmark(eddi=substring(paste(list1, collapse = ""),seq(1, length(list1), 3),pmin(seq(3, length(list1)+2, 3), length(list1))),
               Arun1=sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""),
               Arun2=strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]],
               Grothendieck=apply(matrix(c(list1, rep("", (3 - length(list1) %% 3) %% 3)), 3), 2, paste, collapse = ""),
               times = 30)
#Unit: milliseconds
#         expr      min       lq   median       uq      max neval
#         eddi 417.5631 452.6823 480.4397 528.6187 681.0612    30
#        Arun1 363.0641 401.6795 420.8844 475.2225 587.3645    30
#        Arun2 426.9462 466.5132 506.1106 552.9374 778.7303    30
# Grothendieck 178.2272 206.0161 216.2643 246.3848 280.7988    30

the large N bench

list1 = sample(letters, 1e6, replace = T)
microbenchmark(Arun1=sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""),
+              Grothendieck=apply(matrix(c(list1, rep("", (3 - length(list1) %% 3) %% 3)), 3), 2, paste, collapse = ""), times = 10)
#Unit: seconds
#         expr      min       lq   median       uq      max neval
#        Arun1 5.829132 7.654288 8.582664 8.779793 9.168519    10
# Grothendieck 3.196645 3.416421 3.533622 3.725822 3.951419    10

`Arun1` seems to be the fastest on my case... when I try with 1e5. — Arun, May 06 '13 at 22:49
I've updated my post with benchmarks as well. For your data, all of them take approximately 37ms. Could you try increasing your data to 1e5 and trying the same benchmark? And please run at least 3 times! :) From my benchmarking it seems that the regexp solution `method2` (the slowest) matches closely to that of yours. — Arun, May 06 '13 at 22:51
Arun1? really? that's pretty interesting if it's non-linear in that fashion — eddi, May 06 '13 at 22:57
The only one that seems to scale really well is the `sapply` solution in my benchmarking. Which could be replaced by `tapply` (sapply+split) as @flodel wrote. — Arun, May 06 '13 at 22:59
@Arun, I'm seeing smth similar - that's great, I'm always happy if the more elegant solution wins in benchmarks :) — eddi, May 06 '13 at 23:02
Your solution still shows yours being faster and it seems like a close call. But on my laptop, it's very different (as you see from my benchmarking). — Arun, May 06 '13 at 23:02
@Arun, interesting, I do have a lot of RAM, but the machine is not super fast — eddi, May 06 '13 at 23:05
this is definitely interesting! Maybe someone else can also benchmark. I'm running on MBPm R 3.0.0, OS X Mountain Lion v10.8.2, 8GB RAM. — Arun, May 06 '13 at 23:10
I wonder if it has to do with the difference in the computation time of `pmin` on both our systems. In any case, I'll try to catch you later to check on this sometime... :) — Arun, May 06 '13 at 23:23
@Arun - I just redid the benches and got ~2x faster times on everything :) (the relative positions didn't change though) — eddi, May 07 '13 at 15:01

G. Grothendieck · Answer 3 · 2013-05-07T15:53:37.857

2

1) Try this:

apply(matrix(list1, 3), 2, paste, collapse = "")

2) and a variant that works even if the length of list1 is not a multiple of 3. Here 3 * ceiling(n/3) is the length of m and we subtract n from that to get the number of positions still to fill:

n <- length(list1)
k <- 3 * ceiling(n / 3) - n
m <- matrix(c(list1, rep("", k)), 3)
apply(m, 2, paste, collapse = "")

3) And here is a different solution which like the second solution here also works if n is not a multiple of 3:

n <- length(list1)
tapply(list1, c(gl(n, 3, n)), paste, collapse = "")

UPDATE: Added variant that handles length not a multiple of 3 and a different solution as well.

edited May 07 '13 at 15:53

answered May 07 '13 at 00:23

G. Grothendieck

254,981
17
203
341

Here's an extension of your first solution to any `n`: `apply(matrix(c(list1, rep("", 3 - length(list1) %% 3)), 3), 2, paste, collapse = "")`. You should benchmark this - it looked really fast in some basic tests. – eddi May 07 '13 at 14:45
slight correction: `apply(matrix(c(list1, rep("", (3 - length(list1) %% 3) %% 3)), 3), 2, paste, collapse = "")` (maybe someone can come up with a more compact formula, but the idea is to pad initial list with appropriate number of empty cells) – eddi May 07 '13 at 14:52
I added some benches to my post - this scales really well – eddi May 07 '13 at 15:00
Have added a variant of first solution that does not require n to be a mulitple of 3. – G. Grothendieck May 07 '13 at 15:26
:) Note sure why you don't want to use my suggestion for calculating `k`, as your current formula is longer and is less computationally efficient. – eddi May 07 '13 at 16:01
Thank you for your suggestion but the formula for `k` I used is only 5 characters longer, its more direct and the efficiency of calculating a simple scalar is surely not material. – G. Grothendieck May 07 '13 at 16:16