It can be done in thrust. Explaining how will be rather involved, and the code is rather dense.
The key observation to start with is that the core operation can be done via a transformed reduction. The thrust transform operation is used to perform the elementwise subtraction of the vectors (individual-centroid) and squaring of each result, and the reduction sums the results together to produce the square of the euclidean distance. The starting point for this operation is thrust::reduce_by_key
, but it gets rather involved to present the data correctly to reduce_by_key
.
The final results are produced by taking the square root of each result from above, and we can use an ordinary thrust::transform
for this.
The above is a summary description of the only 2 lines of thrust code that do all the work. However, the first line has considerable complexity to it. In order to exploit parallelism, the approach I took was to virtually "lay out" the necessary vectors in sequence, to be presented to reduce_by_key
. To take a simple example, suppose we have 2 centroids and 4 individuals, and suppose our dimension is 2.
centroid 0: C00 C01
centroid 1: C10 C11
individ 0: I00 I01
individ 1: I10 I11
individ 2: I20 I21
individ 3: I30 I31
We can "lay out" the vectors like this:
C00 C01 C00 C01 C00 C01 C00 C01 C10 C11 C10 C11 C10 C11 C10 C11
I00 I01 I10 I11 I20 I21 I30 I31 I00 I01 I10 I11 I20 I21 I30 I31
To facilitate the reduce_by_key
, we will also need to generate key values to delineate the vectors:
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
The above data "laid-out" data sets can be quite large, and we don't want to incur storage and retrieval cost, so we will generate these "on-the-fly" using thrust's collection of fancy iterators. This is where things get quite dense. With the above strategy in mind, we will use thrust::reduce_by_key to do the work. We'll create a custom functor provided to a transform_iterator
to do the subtraction (and squaring) of the I
and C
vectors, which will be zipped together for this purpose. The "lay out" of the vectors will be created on the fly using permutation iterators with additional custom index-creation functors, to help with the replicated patterns in each of I
and C
.
Therefore, working from the "inside out", the sequence of steps is as follows:
for both I
(data
) and C
(centr
) use a counting_iterator
combined with a custom indexing functor inside of a transform_iterator
to produce the indexing sequences we will need.
using the indexing sequences created in step 1 and the base I
and C
vectors, virtually "lay out" the vectors via a permutation_iterator
(one for each laid-out vector).
zip the 2 "laid out" virtual I
and C
vectors together, to create a <float, float>
tuple vector (virtual).
take the zip_iterator
from step 3, and combine with a custom distance-calculation functor ((I-C)^2
) in a transform_iterator
use another transform_iterator
, combining a counting_iterator
with a custom key-generating functor, to produce the key sequence (virtual)
pass the iterators in steps 4 and 5 to reduce_by_key
as the inputs (keys, values) to be reduced. The output vectors for reduce_by_key
are also keys and values. We don't need the keys, so we'll use a discard_iterator
to dump those. The values we will save.
The above steps are all accomplished in a single line of thrust code.
Here's a code illustrating the above:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#define MAX_DATA 100000000
#define MAX_CENT 5000
#define TOL 0.001
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
unsigned verify(float *d1, float *d2, int len){
unsigned pass = 1;
for (int i = 0; i < len; i++)
if (fabsf(d1[i] - d2[i]) > TOL){
std::cout << "mismatch at: " << i << " val1: " << d1[i] << " val2: " << d2[i] << std::endl;
pass = 0;
break;}
return pass;
}
void eucl_dist_cpu(const float *centroids, const float *data, float *rdist, int num_centroids, int dim, int num_data, int print){
int out_idx = 0;
float dist, dist_sqrt;
for(int i = 0; i < num_centroids; i++)
for(int j = 0; j < num_data; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
rdist[out_idx++] = dist_sqrt;
if (print) std::cout << dist_sqrt << ", ";
}
if (print) std::cout << std::endl;
}
struct dkeygen : public thrust::unary_function<int, int>
{
int dim;
int numd;
dkeygen(const int _dim, const int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val/dim);
}
};
typedef thrust::tuple<float, float> mytuple;
struct my_dist : public thrust::unary_function<mytuple, float>
{
__host__ __device__ float operator()(const mytuple &my_tuple) const {
float temp = thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple);
return temp*temp;
}
};
struct d_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
d_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % (dim*numd));
}
};
struct c_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
c_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % dim) + (dim * (val/(dim*numd)));
}
};
struct my_sqrt : public thrust::unary_function<float, float>
{
__host__ __device__ float operator()(const float val) const {
return sqrtf(val);
}
};
unsigned long long eucl_dist_thrust(thrust::host_vector<float> ¢roids, thrust::host_vector<float> &data, thrust::host_vector<float> &dist, int num_centroids, int dim, int num_data, int print){
thrust::device_vector<float> d_data = data;
thrust::device_vector<float> d_centr = centroids;
thrust::device_vector<float> values_out(num_centroids*num_data);
unsigned long long compute_time = dtime_usec(0);
thrust::reduce_by_key(thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), dkeygen(dim, num_data)), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(dim*num_data*num_centroids), dkeygen(dim, num_data)),thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_centr.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), c_idx(dim, num_data))), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), d_idx(dim, num_data))))), my_dist()), thrust::make_discard_iterator(), values_out.begin());
thrust::transform(values_out.begin(), values_out.end(), values_out.begin(), my_sqrt());
cudaDeviceSynchronize();
compute_time = dtime_usec(compute_time);
if (print){
thrust::copy(values_out.begin(), values_out.end(), std::ostream_iterator<float>(std::cout, ", "));
std::cout << std::endl;
}
thrust::copy(values_out.begin(), values_out.end(), dist.begin());
return compute_time;
}
int main(int argc, char *argv[]){
int dim = 8;
int num_centroids = 2;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
int num_data = 8;
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114,
0.721, 0.555, 0.979, 0.412, 0.007, 0.501, 0.844, 0.234
};
std::cout << "cpu results: " << std::endl;
float dist[num_data*num_centroids];
eucl_dist_cpu(centroids, data, dist, num_centroids, dim, num_data, 1);
thrust::host_vector<float> h_data(data, data + (sizeof(data)/sizeof(float)));
thrust::host_vector<float> h_centr(centroids, centroids + (sizeof(centroids)/sizeof(float)));
thrust::host_vector<float> h_dist(num_centroids*num_data);
std::cout << "gpu results: " << std::endl;
eucl_dist_thrust(h_centr, h_data, h_dist, num_centroids, dim, num_data, 1);
float *data2, *centroids2, *dist2;
num_centroids = 10;
num_data = 1000000;
if (argc > 2) {
num_centroids = atoi(argv[1]);
num_data = atoi(argv[2]);
if ((num_centroids < 1) || (num_centroids > MAX_CENT)) {std::cout << "Num centroids out of range" << std::endl; return 1;}
if ((num_data < 1) || (num_data > MAX_DATA)) {std::cout << "Num data out of range" << std::endl; return 1;}
if (num_data * dim * num_centroids > 2000000000) {std::cout << "data set out of range" << std::endl; return 1;}}
std::cout << "Num Data: " << num_data << std::endl;
std::cout << "Num Cent: " << num_centroids << std::endl;
std::cout << "result size: " << ((num_data*num_centroids*4)/1048576) << " Mbytes" << std::endl;
data2 = new float[dim*num_data];
centroids2 = new float[dim*num_centroids];
dist2 = new float[num_data*num_centroids];
for (int i = 0; i < dim*num_data; i++) data2[i] = rand()/(float)RAND_MAX;
for (int i = 0; i < dim*num_centroids; i++) centroids2[i] = rand()/(float)RAND_MAX;
unsigned long long dtime = dtime_usec(0);
eucl_dist_cpu(centroids2, data2, dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "cpu time: " << dtime/(float)USECPSEC << "s" << std::endl;
thrust::host_vector<float> h_data2(data2, data2 + (dim*num_data));
thrust::host_vector<float> h_centr2(centroids2, centroids2 + (dim*num_centroids));
thrust::host_vector<float> h_dist2(num_data*num_centroids);
dtime = dtime_usec(0);
unsigned long long ctime = eucl_dist_thrust(h_centr2, h_data2, h_dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "gpu total time: " << dtime/(float)USECPSEC << "s, gpu compute time: " << ctime/(float)USECPSEC << "s" << std::endl;
if (!verify(dist2, &(h_dist2[0]), num_data*num_centroids)) {std::cout << "Verification failure." << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
Notes:
The code is set up to do 2 passes, a short one using a data set similar to yours, with printout for visual check. Then a larger data set can be entered, via command-line sizing parameters (number of centroids, then number of individuals), for benchmark comparison and validation of results.
Contrary to what I stated in the comments, the thrust code is only running about 25% faster than the naive single-threaded CPU code. Your mileage may vary.
This is just one way to think about handling it. I have had other ideas, but not enough time to flesh them out.
The data sets can become rather large. The code right now is intended to be limited to data sets where the product of dimension*number_of_centroids*number_of_individuals
is less than 2 billion. However, as you approach even this number, you will need a GPU and CPU that both have a few GB of memory. I briefly explored larger data set sizes. A few code changes would be needed in various places to extend from e.g. int
to unsigned long long
, etc. However I haven't provided that as I am still investigating an issue with that code.
For another, non-thrust-related look at computing euclidean distances on the GPU, you may be interested in this question. If you follow the sequence of optimizations that were made there, it may shed some light on either how this thrust code might be improved, or else how another non-thrust realization could be used.
Sorry I wasn't able to squeeze more performance out.