14

The problem is pretty silly, but I am wondering if I am missing something. Let's say that there is a vector k that contains some numbers, say

> k
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

I want to transform this to a matrix

> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    0    6    7    8    9
[3,]    0    0   10   11   12
[4,]    0    0    0   13   14
[5,]    0    0    0    0   15

My first idea was to use something with upper.tri(), for example like m[upper.tri(m, diag = TRUE)] <- k, but that will not give the matrix above.

Is there a more intelligent solution to this? Below there's my solution but let's just say I am not too proud of it.


rows <- rep(1:5, 5:1)

cols1 <- rle(rows)$lengths


cols <- do.call(c, lapply(1:length(cols1), function(x) x:5))

for(i in 1:length(k)) {
  m[rows[i], cols[i]] <- k[i]
}
Theodor
  • 986
  • 3
  • 7
  • 23

4 Answers4

15

Here's an option using lower.tri and t to transpose the result:

k <- 1:15
m <- matrix(0, 5,5)
m[lower.tri(m, diag = TRUE)] <- k
m <- t(m)
m 
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    1    2    3    4    5
#[2,]    0    6    7    8    9
#[3,]    0    0   10   11   12
#[4,]    0    0    0   13   14
#[5,]    0    0    0    0   15

Microbenchmark

Since there was some confusion with Joseph's benchmark, here's another one. I tested the three solutions for matrices of size 10*10; 100*100; 1000*1000; 10000*10000.

Results:

pic

Apparently, the performance depends heavily on the size of the matrix. For large matrices, Joseph's answer performs fastest, while for smaller matrices, mine was the fastest approach. Note that this doesn't take memory efficiency into account.

Reproducible benchmark:

Joseph <- function(k, n) {
  y <- 1L
  t <- rep(0L,n)
  j <- c(y, sapply(1:(n-1L), function(x) y <<- y+(n+1L)-x))
  t(vapply(1:n, function(x) c(rep(0L,x-1L),k[j[x]:(j[x]+n-x)]), t, USE.NAMES = FALSE))
}

Frank <- function(k, n) {
  m = matrix(0L, n, n)
  m[ which(lower.tri(m, diag=TRUE), arr.ind=TRUE)[, 2:1] ] = k
  m
}

docendo <- function(k,n) {
  m <- matrix(0L, n, n)
  m[lower.tri(m, diag = TRUE)] <- k
  t(m)
}

library(microbenchmark)
library(data.table)
library(ggplot2)
n <- c(10L, 100L, 1000L, 10000L)
k <- lapply(n, function(x) seq.int((x^2 + x)/2))

b <- lapply(seq_along(n), function(i) {
  bm <- microbenchmark(Joseph(k[[i]], n[i]), Frank(k[[i]], n[i]), docendo(k[[i]], n[i]), times = 10L)
  bm$n <- n[i]
  bm
})

b1 <- rbindlist(b)

ggplot(b1, aes(expr, time)) +
  geom_violin() +
  facet_wrap(~ n, scales = "free_y") +
  ggtitle("Benchmark for n = c(10L, 100L, 1000L, 10000L)")

Check equality of results:

all.equal(Joseph(k[[1]], n[1]), Frank(k[[1]], n[1]))
#[1] TRUE
all.equal(Joseph(k[[1]], n[1]), docendo(k[[1]], n[1]))
#[1] TRUE

Note: I didn't include George's approach in the comparison since, judging by Joseph's results, it seems to be a lot slower. So all approaches compared in my benchmark are written only in base R.

talat
  • 68,970
  • 21
  • 126
  • 157
  • I knew there must be a way to make it with upper.tri or lower.tri. Thanks! – Theodor Jun 03 '16 at 13:45
  • Thanks for the thorough benchmarks. The graphs are a very nice touch as well!! The solution you and @Frank provided are much more intuitive (cleaner IMO) and probably more useful for the majority of cases. – Joseph Wood Jun 03 '16 at 19:18
  • 1
    @JosephWood, it was an intersting question and I was honestly a bit surprised that your solution performs so well for large matrices. Hopefully, you'll get some more upvotes. – talat Jun 03 '16 at 19:22
  • @docendo, I’m not too worried about the votes. I had a lot of fun coming up with the solution and yes, I too was very surprised it was so fast. After the first benchmark on a really large example, I was sure I accidentally tested a smaller matrix as I was using different indices for the k’s (i.e. `k1, k2, etc.`). Base R never ceases to amaze me. – Joseph Wood Jun 03 '16 at 19:35
11

A variation on @docendodiscimus' answer: Instead of transposing you can change row and col indices, which you get by wrapping lower.tri in which:

n = 5
m = matrix(0, n, n)

m[ which(lower.tri(m, diag=TRUE), arr.ind=TRUE)[, 2:1] ] = seq(sum(seq(n)))


     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    0    6    7    8    9
[3,]    0    0   10   11   12
[4,]    0    0    0   13   14
[5,]    0    0    0    0   15

To understand how it works, look at the left-hand side in steps:

  • lower.tri(m, diag=TRUE)
  • which(lower.tri(m, diag=TRUE), arr.ind=TRUE)
  • which(lower.tri(m, diag=TRUE), arr.ind=TRUE)[, 2:1]

I guess transposing might be costly if the matrix is large, which is why I'd consider this option. Note: Joseph Wood's answer suggests that I am wrong, since the transposing way is faster in his benchmark.


(Thanks to @JosephWood:) Instead of enumerating and summing with sum(seq(n)), you can use (n^2 - n)/2 + n.

Frank
  • 66,179
  • 8
  • 96
  • 180
8
library(miscTools)
k <- 1:15
triang(k, 5)
gd047
  • 29,749
  • 18
  • 107
  • 146
6

Here is a really fast base R solution:

Update

I've modified the code slightly so that I call only vapply one time instead of the sapply/vapply combo I had before (I also got rid of USE.NAMES=FALSE as it appears to not make any difference). Although this is a bit cleaner, it didn't change the timing significantly on my machine (I reran docendo's benchmarks with plots and it appears to be nearly the same).

Triangle1 <- function(k,n) {
    y <- -n
    r <- rep(0L,n)
    t(vapply(1:n, function(x) {y <<- y+n+2L-x; c(rep(0L,x-1L),k[y:(y+n-x)])}, r))
}

Here are some timings:

Triangle2 <- function(k,n) {
    m <- matrix(0, n,n)
    m[lower.tri(m, diag = TRUE)] <- k
    t(m)
}

Triangle3 <- function(k, n) {
    m = matrix(0, n, n)
    m[ which(lower.tri(m, diag=TRUE), arr.ind=TRUE)[, 2:1] ] = k   ## seq(sum(seq(n)))  for benchmarking
    m
}

k2 <- 1:50005000
n2 <- 10^4

system.time(t1 <- Triangle1(k2,n2))
user  system elapsed           ## previously   user  system elapsed
2.29    0.08    2.41           ##              2.37    0.13    2.52

system.time(t2 <- Triangle2(k2,n2))
user  system elapsed 
5.40    0.91    6.30

system.time(t3 <- Triangle3(k2,n2))
user  system elapsed 
7.70    1.03    8.77 

system.time(t4 <- triang(k2,n2))
user  system elapsed 
433.45    0.20  434.88

One thing that is a bit puzzling to me is that the object produced by Triangle1 is half the size of all the other solutions.

object.size(t1)
400000200 bytes

object.size(t2)   ## it's the same for t3 and t4
800000200 bytes

When I do some checks, it only gets more confusing.

all(sapply(1:ncol(t1), function(x) all(t1[,x]==t2[,x])))
[1] TRUE

class(t1)
[1] "matrix"
class(t2)
[1] "matrix"

attributes(t1)
$dim
[1] 10000 10000
attributes(t2)
$dim
[1] 10000 10000

## not sure what's going on here
identical(t1,t2)
[1] FALSE

identical(t2,t3)
[1] TRUE

As @Frank pointed out in the comments, t1 is an integer matrix whereas the others are numeric. I should have known this as one of the most important R functions would have told me this information from the outset.

str(t1)
int [1:10000, 1:10000] 1 0 0 0 0 0 0 0 0 0 ...
str(t2)
num [1:10000, 1:10000] 1 0 0 0 0 0 0 0 0 0 ...
Community
  • 1
  • 1
Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • 1
    `Triangle3` computes k while the others are given it for free, right? I don't know how long that takes, but it makes the benchmark look asymmetrical. – Frank Jun 03 '16 at 15:33
  • 1
    @Frank, sorry about that. The times now reflect `k` being passed to `Triangle3`. I will say that `seq(sum(seq(n)))` is extremely fast. It registers `0.06` on my machine for `n = 10^4`. – Joseph Wood Jun 03 '16 at 15:45
  • 2
    `t1` is smaller and not identical because it is an integer matrix. – Frank Jun 03 '16 at 16:57