What are some packages that implement semi-supervised (constrained) clustering?

Question

I want to run some experiments on semi-supervised (constrained) clustering, in particular with background knowledge provided as instance level pairwise constraints (Must-Link or Cannot-Link constraints). I would like to know if there are any good open-source packages that implement semi-supervised clustering? I tried to look at PyBrain, mlpy, scikit and orange, and I couldn't find any constrained clustering algorithms. In particular, I'm interested in constrained K-Means or constrained density based clustering algorithms (like C-DBSCAN). Packages in Matlab, Python, Java or C++ would be preferred, but need not be limited to these languages.

You may want to have a look at ELKI. It has tons of clustering algorithms, but I don't recall seeing a constrained clustering in there. Do you have any non-synthetic data sets for this? I always have the impression that this is a purely academic thing. C-DBSCAN might be easy to implement ontop of ELKIs "GeneralizedDBSCAN". — Has QUIT--Anony-Mousse, Jan 22 '14 at 09:25
I'll look into ELKI code, but a first glance suggests that I'll have to build C-DBSCAN on top of the 'GeneralizedDBSCAN' class. And you're correct, I don't have any non-synthetic data sets for this. And this is purely for academic interest. :) — user1271286, Jan 27 '14 at 06:27
Even for academic interest, it should be applicable to real data. There are too many algorithms already that only work with synthetic Gaussian distributions, probably because that is all the authors ever worked on... — Has QUIT--Anony-Mousse, Jan 27 '14 at 08:15

score 5 · Answer 1 · edited Apr 01 '18 at 01:37

5

The python package scikit-learn has now algorithms for Ward hierarchical clustering (since 0.15) and agglomerative clustering (since 0.14) that support connectivity constraints.

Besides, I do have a real world application, namely the identification of tracks from cell positions, where each track can only contain one position from each time point.

edited Apr 01 '18 at 01:37

Brian Burns

20,575
8
83
77

answered Mar 16 '15 at 14:39

germannp

197
4
15

score 4 · Answer 2 · answered Feb 09 '17 at 20:38

The R package conclust implements a number of algorithms:

There are 4 main functions in this package: ckmeans(), lcvqe(), mpckm() and ccls(). They take an unlabeled dataset and two lists of must-link and cannot-link constraints as input and produce a clustering as output.

There's also an implementation of COP-KMeans in python.

score 2 · Answer 3 · answered Apr 01 '14 at 14:20

Maybe its a bit late but have a look at the following.

An extension of Weka (in java) that implements PKM, MKM and PKMKM

http://www.cs.ucdavis.edu/~davidson/constrained-clustering/
Gaussian mixture model using EM and constraints in Matlab

http://www.scharp.org/thertz/code.html

I hope that this helps.

score 2 · Answer 4 · answered Jun 18 '20 at 08:28

Full disclosure. I am the author of k-means-constrained.

Here is a Python implementation of K-Means clustering where you can specify the minimum and maximum cluster sizes. It uses the same API as scikit-learn and so fairly easy to use. It is also based on a fast C++ package and so has good performance.

You can pip install it:

pip install k-means-constrained

Example use:

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
>>>                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
>>>     n_clusters=2,
>>>     size_min=2,
>>>     size_max=5,
>>>     random_state=0
>>> )
>>> clf.fit(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)

score 1 · Answer 5 · answered Dec 14 '20 at 07:40

Github Semisupervised has the similar usage like Sklearn API.

pip install semisupervised

Step 1. The unlabeled samples should be labeled as -1.

Step2. model.fit(X,y)

Step3. model.predict(X_test)

Examples:

from semisupervised.TSVM import S3VM
model = S3VM()
model.fit(np.vstack((label_X_train, unlabel_X_train)), np.append(label_y_train, unlabel_y))
# predict
predict = model.predict(X_test)
acc = metrics.accuracy_score(y_test, predict)
# metric
print("accuracy", acc)

How can I extend this to a multiclass problem for image classification? — Ranji Raj, May 27 '21 at 12:51

score 0 · Answer 6 · answered Jul 02 '20 at 15:54

0

Check out this python package active-semi-supervised-clustering

Github https://github.com/datamole-ai/active-semi-supervised-clustering

answered Jul 02 '20 at 15:54

Mashaal

3
1
1
3

What are some packages that implement semi-supervised (constrained) clustering?

6 Answers6

Linked