I am looking for a way to implement semi-supervised clustering, possibly constrained clustering in R, particularly the "cannot-link" part (I think - but see below). I found this question, but I don't know these languages.
I have a certain data set (of indices for words) that looks like this:
11 195 752 67 77
243 130
8 2178 2581 201 129 77
3872 8967 282
880 77
65 535 363 282
The first index on each line is the "leaf" on an directed acyclic graph, i.e., a tree/hierarchy. There are more than one word hierarcies/trees in the database; the last item on each line is the root/top node (semantically, the most general/abstract word), so a line represents the path from leaf to root. I suppose what I want to get in the end is essentially dimension reduction: instead of say a 1000 unique words there would be 200 clusters of words, which can be useful when dealing with limited data (small texts), since it reduces data sparseness in e.g. topic/summarization models.
What I have so far. First, I derive a similarity matrix (based on how many nodes a pair of leaves have in common in the tree - or more precisely, the Wu&Palmer Wordnet distance metric), then perform partining around medoids (pam) using the cluster
package in R. On the example data, k=3..5 clusters would be ok: if less than 3, leaves with no common root would get shoved into the same cluster. This would make no sense, since they have nothing in common. Mostly this works. However, with larger datasets (500+ such items), even when I choose an appropriately high k, this sometimes happens - things that should not be clustered get clustered. Right now my solution is to just print a warning indicating that (by checking all clusters for a common root node), and choosing a different k.
What I need is a way to "tell"/supervise the clustering algorithm that I don't want certain items to be clustered together (which would be easy in itself - e.g., when calculating the (dis)similarity matrix, I could utilize a parallel matrix of the same dimensions, and when a pair of items has no common root is encountered, there would be a mark there indicating that fact). However, the pam function I've been using does not seem to allow using such constraints. Currently, items with no common root just get assigned 0 similarity (upon converting to a dissimilarity matrix, that would be the maximum distance present). I tried boosting the distance values of such pairs in the matrix, but then I realized pulling such items further apart makes the other items "look closer" to the clustering algorithm, again distorting the results.
After googling I have come to the conclusion that I either need to implement it myself (I have found no R package that does this out of the box) or that I have been looking for the wrong keywords, due to my limited knowledge of machine learning. So for my question (how do I get this working in R?) either answer would be helpful - either suggest an R package that does it, or suggest a way to implement it (some example R code would be helpful to get me started). Or perhaps I've been going about this all wrong, and k-medoids clustering is not the right solution here?
Small update: I found the package flexclust
that indeed seems to do something like that, the function kcca
has a parameter groupFun = "differentClusters"
which is supposed to implement a cannot-link constraint - but the documentation doesn't say much more about it (neither does the cited paper). Also, the kcca
takes a regular data matrix as input, while what I have is a distance matrix (which I construct, cell by cell, by calculating the Wu&Palmer distance metric for each pair of words, based on the leaf-to-root paths described in the data example above). So I guess that is not the solution here.