1

I'm trying to use VLFeat's kmeans implementation in C but I'm having a really hard time understanding how it works.

Note: I am using the C API in a C++ program, so any code posted by me here is C++. Additionally, I am using the Eigean header library, so that's where those Matrix data types are coming from.

Things unclear to from the example and API are:

  1. What format does the data have to be in? The kmeans library functions appear to require a one-dimensional array, which could be taken from the backing of a matrix. However, does this matrix need to be column major or row major? That is, how does the function know to differentiate between dimensions of data and different data vectors?
  2. How do I actually access the cluster center info? I ran a test where I declared I wanted 5 clusters, but using their example code from the link above, I only return 1.

Code:

int numData = 1000;
int dims = 10;
// Use float data and the L1 distance for clustering
VlKMeans * kmeans = vl_kmeans_new (VL_TYPE_FLOAT,  VlDistanceL1) ;
// Use Lloyd algorithm
vl_kmeans_set_algorithm (kmeans, VlKMeansLloyd) ;
// Initialize the cluster centers by randomly sampling the data
Matrix<float, 1000,10, RowMajor> data = buildData(numData, dims);
vl_kmeans_init_centers_with_rand_data (kmeans, data.data(), dims, numData, 5);
// Run at most 100 iterations of cluster refinement using Lloyd algorithm
vl_kmeans_set_max_num_iterations (kmeans, 100) ;
vl_kmeans_refine_centers (kmeans, &data, numData) ;
// Obtain the energy of the solution
energy = vl_kmeans_get_energy(kmeans) ;
// Obtain the cluster centers
centers = (double*)vl_kmeans_get_centers(kmeans);
cout << *centers << endl;

Example Output: centers = 0.0376879 (a scalar)

How do I get all centers? I tried using an array to store centers, but it won't accept the type.

I also tried the following, assuming that perhaps I was just accessing the center info wrong:

cout << centers[0]<< endl;
cout << centers[1]<< endl;
cout << centers[2]<< endl;
cout << centers[3]<< endl;
cout << centers[4]<< endl;
cout << centers[5]<< endl;
cout << centers[6]<< endl;
cout << centers[7]<< endl;
cout << centers[8]<< endl;

But I should only have none-zero values for indices 0-4 (given 5 cluster centers). I actually expected exceptions to be thrown for higher indices. If this is the right approach, could someone please explain to me what these other values (indices 5-8) are from?

I'm sure there are other confusing pieces as well, but I haven't even addressed them yet as I've been stuck on these two pretty important pieces (I mean what is kmeans if you can't cluster properly to start).

Thank you in advance for your help!

marcman
  • 3,233
  • 4
  • 36
  • 71

1 Answers1

2

What format does the data have to be in?

The manual says:

All algorithms support float or double data and can use the l1 or the l2 distance for clustering.

You specify that when you create your kmeans handle, e.g:

VlKMeans *kmeans = vl_kmeans_new(VL_TYPE_FLOAT, VlDistanceL2);

does this matrix need to be column major or row major?

It must be in row major, i.e: data + dimension * i is the i-th center.

How do I actually access the cluster center info?

With vl_kmeans_get_centers. For example if you work with float-s:

/* no need to cast here since get centers returns a `void *` */
const float *centers = vl_kmeans_get_centers(kmeans);

(see this answer regarding the cast)

The total size (in bytes) of this array is sizeof(float) * dimension * numCenters. If you want to print out the centers you can do:

int i, j;
for (i = 0; i < numCenters; i++) {
  printf("center # %d:\n", i);
  for (j = 0; j < dimension; j++) {
    printf("    coord[%d] = %f\n", j, centers[dimension * i + j]);
  }
}
Community
  • 1
  • 1
deltheil
  • 15,496
  • 2
  • 44
  • 64