I want to run some experiments on semi-supervised (constrained) clustering, in particular with background knowledge provided as instance level pairwise constraints (Must-Link or Cannot-Link constraints). I would like to know if there are any good open-source packages that implement semi-supervised clustering? I tried to look at PyBrain, mlpy, scikit and orange, and I couldn't find any constrained clustering algorithms. In particular, I'm interested in constrained K-Means or constrained density based clustering algorithms (like C-DBSCAN). Packages in Matlab, Python, Java or C++ would be preferred, but need not be limited to these languages.
-
You may want to have a look at ELKI. It has tons of clustering algorithms, but I don't recall seeing a constrained clustering in there. Do you have any non-synthetic data sets for this? I always have the impression that this is a purely academic thing. C-DBSCAN might be easy to implement ontop of ELKIs "GeneralizedDBSCAN". – Has QUIT--Anony-Mousse Jan 22 '14 at 09:25
-
I'll look into ELKI code, but a first glance suggests that I'll have to build C-DBSCAN on top of the 'GeneralizedDBSCAN' class. And you're correct, I don't have any non-synthetic data sets for this. And this is purely for academic interest. :) – user1271286 Jan 27 '14 at 06:27
-
2Even for academic interest, it should be applicable to real data. There are too many algorithms already that only work with synthetic Gaussian distributions, probably because that is all the authors ever worked on... – Has QUIT--Anony-Mousse Jan 27 '14 at 08:15
6 Answers
The python package scikit-learn has now algorithms for Ward hierarchical clustering (since 0.15) and agglomerative clustering (since 0.14) that support connectivity constraints.
Besides, I do have a real world application, namely the identification of tracks from cell positions, where each track can only contain one position from each time point.

- 20,575
- 8
- 83
- 77

- 197
- 4
- 15
The R package conclust implements a number of algorithms:
There are 4 main functions in this package: ckmeans(), lcvqe(), mpckm() and ccls(). They take an unlabeled dataset and two lists of must-link and cannot-link constraints as input and produce a clustering as output.
There's also an implementation of COP-KMeans in python.

- 81
- 2
Maybe its a bit late but have a look at the following.
An extension of Weka (in java) that implements PKM, MKM and PKMKM
Gaussian mixture model using EM and constraints in Matlab
I hope that this helps.

- 452
- 1
- 3
- 15
Full disclosure. I am the author of k-means-constrained.
Here is a Python implementation of K-Means clustering where you can specify the minimum and maximum cluster sizes. It uses the same API as scikit-learn and so fairly easy to use. It is also based on a fast C++ package and so has good performance.
You can pip install it:
pip install k-means-constrained
Example use:
>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
>>> [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
>>> n_clusters=2,
>>> size_min=2,
>>> size_max=5,
>>> random_state=0
>>> )
>>> clf.fit(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
>>> clf.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)

- 1,499
- 3
- 20
- 33
Github Semisupervised has the similar usage like Sklearn API.
pip install semisupervised
Step 1. The unlabeled samples should be labeled as -1.
Step2. model.fit(X,y)
Step3. model.predict(X_test)
Examples:
from semisupervised.TSVM import S3VM
model = S3VM()
model.fit(np.vstack((label_X_train, unlabel_X_train)), np.append(label_y_train, unlabel_y))
# predict
predict = model.predict(X_test)
acc = metrics.accuracy_score(y_test, predict)
# metric
print("accuracy", acc)

- 1,797
- 1
- 21
- 33
-
How can I extend this to a multiclass problem for image classification? – Ranji Raj May 27 '21 at 12:51
Check out this python package active-semi-supervised-clustering
Github https://github.com/datamole-ai/active-semi-supervised-clustering

- 3
- 1
- 1
- 3