2

I have points with binary features:

id, feature 1, feature 2, ....
1, 0, 1, 0, 1, ...
2, 1, 1, 0, 1, ...

and the size of matrix is about 20k * 200k but it is sparse. I am using Mahout for clustering data by kmeans algorithm and have the following questions:

  1. Is kmeans a good candidate for binary features?
  2. Is there any way to reduce dimensions while keeping the concept of Manhattan distance measure (I need manhattan instead of Cosine or Tanimoto)
  3. The memory usage of kmeans is high and needs 4GB memory for each Map/Reduce Task on (4Mb Blocks on 400Mb vector file for 3k clusterss). Considering that Vector object in Mahout uses double entries, is there any way to use just Boolean entries for points but double entries for centers?
Masood_mj
  • 1,144
  • 12
  • 25

2 Answers2

2

k-means is a good candidate if you have a good distance metric. Manhattan distance could be fine; I like log-likelihood.

You can use any dimension reduction technique you like. I like alternating-least-squares; the SVD works well too. For this size matrix you can do it easily in memory with Commons Math rather than bother with Hadoop -- it is way way overkill.

(See also http://myrrix.com -- I have a very fast ALS implementation there you can reuse in the core/online modules. It can crunch this in a few seconds in tens of MB heap.)

You no longer have binary 0/1 values in your feature matrix. In the feature space, cosine distance should work well (1 - cosineSimilarity). Tanimoto/Jaccard is not appropriate.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • I am not sure to understand totally the problem at hand but Jaccard distance is appropriate when comes to evaluating similarity between two objects with binary attributes. – Andres Felipe Aug 25 '13 at 00:34
2

k-means has one big requirement that is often overlooked: it needs to compute a sensible mean. This is much more important than people think.

  • If the mean does not reduce variance, it may not converge (The arithmetic mean is optimal for Euclidean distance. For Manhattan, the median is said to be better. For very different metrics, I do not know)
  • The mean probably won't be as sparse anymore
  • The mean won't be a binary vector anymore, either

Furthermore, in particular for large data sets, which k do you want to use?

You really should look into other distance measures. Your data size is not that big; it should still suffice to use a single computer. Using a compact vector representation it will easily fit into main memory. Just don't use something that computes a n^2 similarity matrix first. Maybe try something with indexes for binary vector similarity.

k-means is fairly easy to implement, in particular if you don't do any advance seeding. To reduce memory usage, just implement it yourself for the representation that is optimal for your data. It could be a bitset, it could be a sorted list of dimensions that are non-zero. Manhattan distance then boils down to counting the number of dimensions where the vectors differ!

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • As you said centeriods does not have binary values so I cannot keep binary values for them. It seems that if I want to treat centroids and actual data differently I need to have my own implementation. Also I cannot access a machine with tens of Gigabytes memory so it seems that mahout is the best option for me as it can show that the approach is scalable. I should try Median method may be by changing mahout implementation and see the result. – Masood_mj Jul 12 '12 at 16:54
  • Yes. And have a look at k-medians. The median values are binary again, and are trivial and cheap to compute for binary data: they are the most common value for the cluster. But that also is a risk: that value probably will just be 0 always! At this point, you end up doing just some random feature selection. So grab another algorithm, nothing from the k-means family. – Has QUIT--Anony-Mousse Jul 12 '12 at 16:57
  • @Anony-Mousse though I don't disagree with your arguments against k-means, also exposed in this other [stack exchange question](http://stats.stackexchange.com/questions/39023/what-to-use-k-means-or-hierarchical-clustering-for-presence-absence-data), I am curious on what do you base them since k-means aided by hamming distance and component-wise medians is a popular choice when comes to use k-means for binary streams, e.g. Matlab k-means implementation – Andres Felipe Aug 25 '13 at 00:52
  • k-medians is a known variant, which is stable because the median optimizes the absolute deviations. However, for sparse binary data, the median in each dimension will most likely be 0. Not everything that is used, offered or popular is stable and mathematically well founded. – Has QUIT--Anony-Mousse Aug 25 '13 at 10:20