16

I have some code that will calculate the distances between every Cartesian coordinate in one matrix to every other coordinate in another. For each coordinate, the minimum distance will be returned along with the index positions for the coordinates that produced the minimum.

function MED3D(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,3))
    @sync @distributed for k in 1:n1
        Dist[k,:] = MD3D(m1[k,:], m2, k)
    end
    return Dist
end

@everywhere function MD3D(v1, m2, k)
    dsum::Float64 = Inf
    dtemp::Float64 = Inf
    i = 0
    for j in 1:size(m2,1)
        @inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
        if dtemp < dsum
            dsum = dtemp
            i = j
        end
    end
    return [dsum, k, i]
end

m1 = rand(10,3)
m2 = rand(15,3)
results = MED3D(m1,m2)

While this works reasonably alright with smaller 3D point clouds, I am looking to increase the performance for large point clouds by using GPU-based analyses. However, using the more typical ways to do matrix operations in Julia seems not possible since I have to return the index positions and the minimum distance. I've tried several different ways to adopt CUarrays for this task, but so far they have all failed without using actual for loops. Also, a lot of ways to implement it appear exceptionally inefficient due to storing the distance matrix in memory, which quickly exceeds 128gb of ram for my particular data set.

Could someone please help me with how to properly implement this in Julia to operate on a GPU? Is CUarrays even the correct approach, or is it too abstract of a level given that I am returning indices in addition to a distance? I've tried to calculate the L2 norm using product and dot, but it's not exactly giving me what I require.

UPDATE:

Here is my failed attempt to GPUify the inner loop using broadcasting.

using CuArrays
function difff(m1,m2)
    n1 = size(m1,1)
    Dist = Array{Float64}(undef, n1,3)
    m2 = CuArray(m2)
    m1 = CuArray(m1)
    for z in 1:size(m1)
        v1 = transpose(m1[z,:])
        i = 0
        dsum::Float64 = Inf
        mi = v1 .- m2
        mi = mi .* mi
        mi = sum(mi, dims=2)
        mi = mi .^ 0.5
        mi = findmin(mi)
        i = mi[2][1]
        dsum = mi[1]
        @inbounds Dist[z,:] = [dsum,z,i]
    end
end

UPDATE:

Failed attempt #2. I tried to calculating the minimum distances and forgetting about the indices. This isn't ideal for my application, but I can live with it. However, this only works correctly if the first array has a single row. I tried to solve this by using mapslices, but it does not work.

using CuArray
a = rand(1,3)
b = rand(3,3)

a = CuArray(a)
b = CuArray(b)

function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

mapslices(GK(b), a, 2)

UPDATE:

Making progress by using an outer loop, but surely there is a better way to do this?

using CuArray
using BenchmarkTools
aa = rand(2,3)
bb = rand(5000000,3)

a = CuArray(aa)
b = CuArray(bb)

function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

function D(a,b)
    Dist = Array{Float64}(undef,size(a,1),1)
    for i in 1:size(a,1)
        Dist[i] = GK(a[i,:]',b)
    end
    return Dist
end

@benchmark test = D(a,b)
@benchmark test = D(aa,bb)

UPDATE:

Some benchmarking between my previous distributed version, modified distributed version, GPU version, and serial version.EDIT: After scaling to a 100 billion comparisons, the GPU version no longer outperforms my previous distributed version... Any thoughts on why this is????

using Distributed
using SharedArrays
using CuArrays
using BenchmarkTools

aa = rand(4,3)
bb = rand(500000,3)
a = CuArray(aa)
b = CuArray(bb)

function MED3D(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1
        Dist[k] = MD3D(m1[k,:]', m2)
    end
    return Dist
end

@everywhere function MD3D(v1, m2)
    dsum::Float64 = Inf
    dtemp::Float64 = Inf
    for j in 1:size(m2,1)
        @inbounds dtemp = sqrt((v1[1] - m2[j,1]) * (v1[1] - m2[j,1]) + (v1[2] - m2[j,2]) * (v1[2] - m2[j,2]) + (v1[3] - m2[j,3]) * (v1[3] - m2[j,3]))
        if dtemp < dsum
            dsum = dtemp
        end
    end
    return dsum
end

function MED3DGK(m1, m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1

        @inbounds Dist[k] = GK(m1[k,:]',m2)
    end
    return Dist
end

@everywhere function GK(m1, m2)
    reduce(min, sum((m1 .- m2) .^ 2,dims=2) .^ 0.5)
end

function D(a,b)
    Dist = Array{Float64}(undef,size(a,1),1)
    for i in 1:size(a,1)
        @inbounds Dist[i] = GK(a[i,:]',b)
    end
    return Dist
end

@benchmark test = D(a,b)
@benchmark test = D(aa,bb)
@benchmark test = MED3D(aa,bb)
@benchmark test = MED3DGK(aa,bb)

UPDATE:

Implementation using NearestNeighbors.jl with distributed processing. Any thoughts on how to make this even faster?:

function MED3D(m1, m2)
    m2 = Matrix(m2')
    kdtree = KDTree(m2)
    n1::Int = size(m1,1)
    Dist = SharedArray{Float64}((n1,1))
    Ind = SharedArray{Float64}((n1,1))
    @sync @distributed for k in 1:n1
        Ind[k,:], Dist[k,:] = knn(kdtree, m1[k,:], 1)
    end
    return [Ind,Dist]
end
JJL
  • 342
  • 2
  • 17
  • Have you looked into Distances.jl? I'm not sure if their algorithms work in distributed systems, but it would be worth checking (and opening an issue if they don't). Sorry I can't help with your actual question. – kevbonham Oct 24 '19 at 14:13
  • 1
    @kevbonham I've looked into Distances.jl. The reason I haven't used it was the lack of parallel support. Given how much data I have to analyze, using a serial approach just isn't realistic. Also, from what I can see, they have a tendency to return all of the data rather than tossing out the unneeded distances, which causes a significant increase in memory allocation for my particular case. Returning an entire distance matrix is just crazy for me. – JJL Oct 24 '19 at 15:15
  • Got it - makes sense. Good luck finding a solution - and if you do, I'm sure the folks at Distances.jl would love to hear it. – kevbonham Oct 24 '19 at 19:49
  • 2
    Have you considered building NNTrees and use knn from NearestNeighbors.jl? I think it's pretty hard to beat for perf. – Michael K. Borregaard Oct 24 '19 at 20:51
  • @michaelK.Borregaard I have considered it, but for my particular task I am using the distances to calculate precise error between models. The nearestneighbor approach doesn't always produce the exact equivalent correspondences. It would represent an approximation rather than the true distance. – JJL Oct 24 '19 at 21:40
  • 1
    The NearestNeighbors package produces indices and exact distances - are you not conflating this with a different algorithm? – Michael K. Borregaard Oct 25 '19 at 06:20
  • If it doesn't work for you(though looking at it's description I think borregaard is right) you should consider looking in optimizing in a way similar to that anyway, making use of a gpu is a nice brute force force multiplier but only having to make comparisons for a fraction of the points should help enormly. – Pete Oct 25 '19 at 08:10
  • I will run some tests and see if they produce the same distances. As far as I remember, using a nearestneighbor algorithm like kNN can cause a lack of correct correspondences since actual nearestneighbors can be split into different groups. So it creates issues near the bounds of the tree. – JJL Oct 25 '19 at 13:24
  • Do you guys know if transposing the matrix so rows are dimensions still produces the correct calculations in NN? Doing it where rows are coordinates is extremely slow with the NN package. The parallel code I have outperforms NN in that regard. – JJL Oct 25 '19 at 13:41
  • 3
    Well, I was definitely wrong! the NearestNeighbor package is producing identical results even when scaling to billions of comparisons and about a 10,000 fold increase in speed. I've been going about this all wrong. – JJL Oct 25 '19 at 15:16
  • 1
    That's great that you managed to get such a huge speed-up! I'm wondering if there's a way to reformulate this question in a way that is useful and answerable. Having all of your updates is great documentation, but SO really depends on Questions and answers. This is a bit of a journey instead (maybe would be good to reproduce over at discourse.julialang.org? – kevbonham Oct 28 '19 at 13:53

1 Answers1

1

I'm not sure if it will help in your case, but when taking a slice m1[k,:] by default julia copies that memory (although maybe it depends on what the knn function does with that slice.

Does it improve anything if you change it to knn(kdtree, @view m1[k,:], 1)

jerlich
  • 340
  • 3
  • 12