5

There is an array of arrays containing more than 10,000 pairs of Float64 values. Something like this:

v = [[rand(),rand()], ..., [rand(),rand()]]

I want to get a matrix with two columns from it. It is possible to bypass all pairs with a cycle, it looks cumbersome, but gives the result in a fraction of a second:

x = Vector{Float64}()
y = Vector{Float64}()
for i = 1:length(v)
    push!(x, v[i][1])
    push!(y, v[i][2])
end
w = hcat(x,y)

The solution with permutedims(reshape(hcat(v...), (length(v[1]), length(v)))), which I found in this task, looks more elegant but completely suspends Julia, is needed to restart the session. Perhaps it was optimal six years ago, but now it is not working in the case of large arrays. Is there a solution that is both compact and fast?

Anton Degterev
  • 591
  • 2
  • 12
  • I don't understand why your loop example creates two vectors, `x` and `y`. Why not just create a matrix and then write the values straight into that? Seems much more direct? – DNF May 17 '21 at 21:30

3 Answers3

12

I hope this is short and efficient enough for you:

 getindex.(v, [1 2])

and if you want something simpler to digest:

[v[i][j] for i in 1:length(v), j in 1:2]

Also the hcat solution could be written as:

permutedims(reshape(reduce(hcat, v), (length(v[1]), length(v))));

and it should not hang your Julia (please confirm - it works for me).

@Antonello: to understand why this works consider a simpler example:

julia> string.(["a", "b", "c"], [1 2])
3×2 Matrix{String}:
 "a1"  "a2"
 "b1"  "b2"
 "c1"  "c2"

I am broadcasting a column Vector ["a", "b", "c"] and a 1-row Matrix [1 2]. The point is that [1 2] is a Matrix. Thus it makes broadcasting to expand both rows (forced by the vector) and columns (forced by a Matrix). For such expansion to happen it is crucial that the [1 2] matrix has exactly one row. Is this clearer now?

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
3

Your own example is pretty close to a good solution, but does some unnecessary work, by creating two distinct vectors, and repeatedly using push!. This solution is similar, but simpler. It is not as terse as the broadcasted getindex by @BogumilKaminski, but is faster:

function mat(v)
    M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
    for i in eachindex(v)
        M[i, 1] = v[i][1]
        M[i, 2] = v[i][2]
    end
    return M
end

You can simplify it a bit further, without losing performance, like this:

function mat_simpler(v)
    M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
    for (i, x) in pairs(v)
        M[i, 1], M[i, 2] = x
    end
    return M
end
DNF
  • 11,584
  • 1
  • 26
  • 40
1

A benchmark of the various solutions posted so far...

using BenchmarkTools
# Creating the vector
v = [[i, i+0.1] for i in 0.1:0.2:2000]

M1 = @btime vcat([[e[1] e[2]] for e in $v]...)
M2 = @btime getindex.($v, [1 2])
M3 = @btime [v[i][j] for i in 1:length($v), j in 1:2]
M4 = @btime permutedims(reshape(reduce(hcat, $v), (length($v[1]), length($v))))
M5 = @btime permutedims(reshape(hcat($v...), (length($v[1]), length($v))))

function original(v)
    x = Vector{Float64}()
    y = Vector{Float64}()
    for i = 1:length(v)
        push!(x, v[i][1])
        push!(y, v[i][2])
    end
    return hcat(x,y)
end
function mat(v)
    M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
    for i in eachindex(v)
        M[i, 1] = v[i][1]
        M[i, 2] = v[i][2]
    end
    return M
end
function mat_simpler(v)
    M = Matrix{eltype(eltype(v))}(undef, length(v), 2)
    for (i, x) in pairs(v)
        M[i, 1], M[i, 2] = x
    end
    return M
end

M6 = @btime original($v)
M7 = @btime mat($v) 
M8 = @btime mat($v)

M1 == M2 == M3 == M4 == M5 == M6 == M7 == M8 # true

Output:

1.126 ms (10010 allocations: 1.53 MiB)       # M1
  54.161 μs (3 allocations: 156.42 KiB)      # M2
  809.000 μs (38983 allocations: 765.50 KiB) # M3
  98.935 μs (4 allocations: 312.66 KiB)      # M4
  244.696 μs (10 allocations: 469.23 KiB)    # M5
219.907 μs (30 allocations: 669.61 KiB)      # M6
34.311 μs (2 allocations: 156.33 KiB)        # M7
34.395 μs (2 allocations: 156.33 KiB)        # M8

Note that the dollar sign in the benchmarked code is just to force @btime to consider the vector as a local variable.

Antonello
  • 6,092
  • 3
  • 31
  • 56