Kmeans clustering acceleration in GPU(CUDA)

Question

I am a fairly new cuda user. I'm practicing on my first cuda application where I try to accelerate kmeans algorithm by using GPU(GTX 670).

Briefly, each thread works on a single point which is compared to all cluster centers and a point is assigned to a center with minimum distance(kernel code can be seen below with comments).

According to Nsight Visual Studio, I have an occupancy of 99.61%(1024 blocks, 1024 threads per block), 99.34% Streaming Multiprocessor activity, 79.98% warp issue efficiency, no shared memory bank conflicts, 18.4GFLOPs Single MUL and 55.2 GFLOPs Single ADD(takes about 14,5 ms to complete kmeans kernel with given parameters).

According to Wikipedia, GTX670's peak performance is 2460 GFLOPs. I am nowhere close to it. In addition to these, some papers claim they can achieve more than half of the peak performance. I cannot see how further I can optimize this kernel code. Is there any optimization that I can apply to the kernel? Any suggestion or help is appreciated and I can give any additional information on demand.

Complete Code

Thanks in advance.

#define SIZE 1024*1024 //number of points
#define CENTERS 32     //number of cluster centroids
#define DIM 8          //dimension of each point and center
#define cudaTHREADSIZE 1024 //threads per block
#define cudaBLOCKSIZE SIZE/cudaTHREADSIZE //number of blocks for kernel

__global__ void kMeans(float *dp, float *dc,int *tag, int *membershipChangedPerBlock)
{
    //TOTAL NUMBER OF THREADS SHOULD BE EQUAL TO THE NUMBER OF POINTS, BECAUSE EACH THREAD WORKS ON A SINGLE POINT
    __shared__ unsigned char  membershipChanged[cudaTHREADSIZE];
    __shared__ float dc_shared[CENTERS*DIM];

    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int threadID = threadIdx.x;

    membershipChanged[threadIdx.x] = 0;
    //move centers to shared memory, because each and every thread will call it(roughly + %10 performance here)
    while(threadID < CENTERS*DIM){
        dc_shared[threadID] = dc[threadID];

        threadID += blockDim.x;
    }
    __syncthreads();

    while(tid < SIZE){
        int   index,prevIndex;
        float dist, min_dist;

        index = 0;//all initial point indices(centroid number) are assigned to 0.
        prevIndex = 0;
        dist = 0;
        min_dist = 0;

        //euclid distance for center 0
        for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
            min_dist += (dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS]);
        }

        //euclid distance for other centers with distance comparison
        for(int centerIdx = 1; centerIdx < CENTERS; centerIdx++){
            dist = 0;
            for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
                dist += (dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS]);    
            }   
            //compare distances, if found a shorter one, change index to that centroid number
            if(dist < min_dist){
                min_dist = dist;
                index = centerIdx;
            }
        }

        if (tag[tid] != index) {//if a point's cluster membership changes, flag it as changed in order to compute total membership changes later on
            membershipChanged[threadIdx.x] = 1;
        }
        tag[tid] = index;

        __syncthreads();//sync before applying sum reduction to membership changes


        //sum reduction
        for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
            if (threadIdx.x < s) {
                membershipChanged[threadIdx.x] +=
                    membershipChanged[threadIdx.x + s];
            }
            __syncthreads();
        }

        if (threadIdx.x == 0) {
            membershipChangedPerBlock[blockIdx.x] = membershipChanged[0];
        }
        tid += blockDim.x * gridDim.x;
    }
}

I suggest providing a complete code that someone else could compile and run. — Robert Crovella, Mar 21 '15 at 20:12
You haven't mentioned memory bandwidth anywhere in your question. I would assume the code you posted is global memory bandwidth limited. — talonmies, Mar 22 '15 at 11:26
I looked up memory statistics report with nsight and this is the result http://imgur.com/fa45zfi . GTX670 has 192.256 GB/s global memory bandwidth. According to these results, L2 cache uses 137.63 GB/s bandwidth. Does that mean I am practically using 0.7 of peak bandwidth? — menderft, Mar 22 '15 at 16:04
@talonmies After looking at Issue Efficiency tab of Experiment Results of Nsight nvreport, I found out that 84.55% of issue stall reason was execution dependency. Each thread works on a single point, so that a thread cannot calculate the next dimension distance and sum it with total distance without calculating the previous dimension distance. I am aiming to increase the number of parallel dimension distance calculations by using thread registers and I will sum those semi-total distances in the end. Hope this will increase my FLOPs by using more bandwidth and less execution dependency stalls. — menderft, Mar 25 '15 at 21:03

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

4

My advice is to compare your work with a more exprienced GPU developer's work. I found out Kmeans algorithm is written by Byran Catanzaro after watching this video. You can find the source code:

https://github.com/bryancatanzaro/kmeans

I am also a beginner but IMHO it is better to use libraries like "Trust". GPU programming is really complicated issue it is hard to achieve max performance "Trust" will help you with that.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 09 '15 at 18:20

Kadir Erdem Demir

3,531
3
28
39

My aim was to learn optimizing my own code for learning purposes and it worked charms. I always aim to write existing libraries myself and compare their performances. It motivates me whenever I can beat libraries even if it requires some specific inputs :) Teşekkürler. – menderft Aug 10 '15 at 19:26

score 1 · Answer 2 · answered Jan 14 '21 at 15:26

Check out rapids.ai cuml which replicates scikit api

Example from docs:

from cuml import KMeans
from cuml.cluster import KMeans

import cudf
import numpy as np
import pandas as pd

def np2cudf(df):
    # convert numpy array to cuDF dataframe
    df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
    pdf = cudf.DataFrame()
    for c,column in enumerate(df):
        pdf[str(c)] = df[column]
    return pdf

a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
               dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)

print("Calling fit")
kmeans_float = KMeans(n_clusters=2)
kmeans_float.fit(b)

print("labels:")
print(kmeans_float.labels_)
print("cluster_centers:")
print(kmeans_float.cluster_centers_)

Kmeans clustering acceleration in GPU(CUDA)

2 Answers2