I've had encouraging results clustering a set of entity names using scikit-learn's affinity propagation implementation, with a modified Jaro-Winkler distance as the similarity metric, but my clusters are still too numerous (ie. too many false positives.)
I see in the scikit-learn documentation that there exists a 'preference' parameter that affects the number of clusters, with the following description:
preference : array-like, shape (n_samples,) or float, optional
Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.[0]
However, when I began tinkering with this value, I found that a very narrow range of values was giving me either too many clusters (preference=-11.13
) or too few clusters (preference=-11.11
).
Is there some way to determine what a 'reasonable' value of the preference parameter should be? And why would it be that I'm unable to obtain a non-extreme number of clusters?
Similar questions: