What is the most elegant way of converting list1
to list2
, and also list2
to list1
?
list1<- c('a','b','c','d','e','f','g','h','i')
list2<- c('abc','def','ghi')
i.e: contactenate elements in groups of three.
thanks :D
What is the most elegant way of converting list1
to list2
, and also list2
to list1
?
list1<- c('a','b','c','d','e','f','g','h','i')
list2<- c('abc','def','ghi')
i.e: contactenate elements in groups of three.
thanks :D
Let list1 <- letters[1:10]
(to show how it works when the length of the vector is not a multiple of 3). Then, try this:
# method 1 (seems to be the fastest so far,
# my suspicions about loop being slower were wrong)
list2 <- sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = "")
# alternatively as @flodel mentions
list2 <- tapply(list1, (seq_along(list1)-1) %/% 3, paste, collapse = "")
The tapply
version runs at a similar time as sapply+split
(benchmarking not shown).
Going one step further, using @JoshOBrien's idea in this post
# method 2
pattern <- "(?<=[[:alnum:]]{3})(?=[[:alnum:]])"
strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]]
# [1] "abc" "def" "ghi" "j"
And if you want to get the last part concatenated to the last-but-one (here the j
to ghi
) then, do:
pattern <- "(?<=[[:alnum:]]{3})(?=[[:alnum:]]{3})"
strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]]
# [1] "abc" "def" "ghij"
unlist(strsplit(list2, ""), use.names=FALSE)
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
Here's a benchmarking of method1
, method2
and eddi's:
list1 <- sample(letters, 1e5, replace=TRUE)
arun <- function() {
pattern <- "(?<=[[:alnum:]]{3})(?=[[:alnum:]])"
strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]]
}
arun2 <- function() {
unname(sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""))
}
eddi <- function() {
substring(paste(list1, collapse = ""),
seq(1, length(list1), 3),
pmin(seq(3, length(list1)+2, 3), length(list1)))
}
require(microbenchmark)
microbenchmark(t1 <- arun(), t2 <- eddi(), t3 <- arun2(), times=10)
identical(t1, t2) # TRUE
identical(t1, t3) # TRUE
# Unit: milliseconds
# expr min lq median uq max neval
# t1 <- arun() 3352.9867 3400.8627 3512.7037 3585.6499 3635.2182 10
# t2 <- eddi() 3302.0925 3318.4184 3356.2109 3409.9728 3487.7220 10
# t3 <- arun2() 474.9235 494.7407 539.4406 641.2605 907.9072 10
Here's another version, that's faster than both of @Arun's methods (imo at the expense of readability as compared to his method 1, which is unfortunately much much slower than his method 2 or this) [edit: after some benchmarking it seems like Arun's first method while not doing so well at small-medium size, actually scales much better, winning at larger sizes] [[another edit: the Grothendieck solution is another one that doesn't do well at small size, but scales even better than Arun's first method]]:
substring(paste(list1, collapse = ""),
seq(1, length(list1), 3),
pmin(seq(3, length(list1)+2, 3), length(list1)))
benchmark
list1 = sample(letters, 10000, replace = T)
microbenchmark(eddi=substring(paste(list1, collapse = ""),seq(1, length(list1), 3),pmin(seq(3, length(list1)+2, 3), length(list1))),
Arun1=sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""),
Arun2=strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]],
Grothendieck=apply(matrix(c(list1, rep("", (3 - length(list1) %% 3) %% 3)), 3), 2, paste, collapse = ""),
times = 100)
#Unit: milliseconds
# expr min lq median uq max neval
# eddi 8.804764 10.17807 11.33133 11.58993 12.69495 100
# Arun1 51.287326 61.74937 65.51151 67.15510 73.98805 100
# Arun2 12.305300 13.52000 14.65123 15.00816 17.20151 100
# Grothendieck 25.043657 29.15488 29.87843 31.02118 45.85889 100
benchmarks continued This is somewhat interesting, at 1e5, Arun1 actually edges out the other two slightly:
list1 = sample(letters, 1e5, replace = T)
microbenchmark(eddi=substring(paste(list1, collapse = ""),seq(1, length(list1), 3),pmin(seq(3, length(list1)+2, 3), length(list1))),
Arun1=sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""),
Arun2=strsplit(paste(list1, collapse=""), pattern, perl=TRUE)[[1]],
Grothendieck=apply(matrix(c(list1, rep("", (3 - length(list1) %% 3) %% 3)), 3), 2, paste, collapse = ""),
times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# eddi 417.5631 452.6823 480.4397 528.6187 681.0612 30
# Arun1 363.0641 401.6795 420.8844 475.2225 587.3645 30
# Arun2 426.9462 466.5132 506.1106 552.9374 778.7303 30
# Grothendieck 178.2272 206.0161 216.2643 246.3848 280.7988 30
the large N bench
list1 = sample(letters, 1e6, replace = T)
microbenchmark(Arun1=sapply(split(list1, (seq_along(list1)-1) %/% 3), paste, collapse = ""),
+ Grothendieck=apply(matrix(c(list1, rep("", (3 - length(list1) %% 3) %% 3)), 3), 2, paste, collapse = ""), times = 10)
#Unit: seconds
# expr min lq median uq max neval
# Arun1 5.829132 7.654288 8.582664 8.779793 9.168519 10
# Grothendieck 3.196645 3.416421 3.533622 3.725822 3.951419 10
1) Try this:
apply(matrix(list1, 3), 2, paste, collapse = "")
2) and a variant that works even if the length of list1
is not a multiple of 3. Here 3 * ceiling(n/3)
is the length of m
and we subtract n
from that to get the number of positions still to fill:
n <- length(list1)
k <- 3 * ceiling(n / 3) - n
m <- matrix(c(list1, rep("", k)), 3)
apply(m, 2, paste, collapse = "")
3) And here is a different solution which like the second solution here also works if n
is not a multiple of 3:
n <- length(list1)
tapply(list1, c(gl(n, 3, n)), paste, collapse = "")
UPDATE: Added variant that handles length not a multiple of 3 and a different solution as well.