How does (carrot) clustering work in solr?

Question

i have running Lucene/Solr 4 for testing different features, also "clustering". Currently, 1 million documents are indexed. Every document has the following fields:

ID (unique Key) Example1: 10245
               Example2: 24974
TOPIC (Keywords of the document) Example1: "disaster/japan/nuclear power station"
                                 Example2: "world/japan/nuclear power"
HEADLINE (1 line of text): Example1: "explosion at nuclear power plant in japan"
                           Example2: "news about japans nuclear power plant"
TEXT (the full text): "In the Japanese nuclear power plant in Fukushima..."

All the fields are indexed and stored, exapt TEXT, which is only indexed, not stored. I use the following specific configuration:

  <str name="carrot.title">TOPIC</str>
   <str name="carrot.snippet">HEADLINE</str>

If you looking the example you see, that the TOPIC is different, but japan is the same. Is it possible to configure solr/carrot in that way, that example1 and example2 will be in one cluster? Because of the matching "japan"?!

Further there could be an 3rd TOPIC like "news/nuclear power", no "japan" inside but HEADLINE and TEXT are using the words: japans power plant. What solr/carrot configuration is relevant in order to receive those 3 news in one cluster?

Thank you!

score 4 · Accepted Answer · answered Jul 14 '11 at 10:38

Carrot2 is designed to cluster natural / unstructured text and such algorithms will very rarely produce results that a human would find perfect. Unfortunately, such algorithms are also hard to "debug" -- the clusters they produce depend on many factors, such as the frequencies with which words occur in your documents. In your specific example, the word Japan may not have been chosen to form a cluster because it's too frequent -- it appears in all of the documents you quoted.

Here are a few tips you may want to try to tweak the clusters:

Try separating keywords with a period followed by a space rather than a slash, e.g. "disaster. japan. nuclear power station". If you do that, Carrot2 will treat word sequences, such as "nuclear power station", as phrases rather than individual words.
Try a different Carrot2 clustering algorithm, e.g. STC.
If there is a chance to get your full story text field stored (or maybe part of it, such as the first paragraph), use the HEADLINE for carrot.title and the full text / excerpt for carrot.snippet.
Play with the specific settings of Carrot2 algorithms. The best tool for this would be Carrot2 Clustering Workbench. Here's how to connect it to Solr: http://wiki.apache.org/solr/ClusteringComponent#Tuning_Carrot2_clustering

Thank you, there are a lot of interesting ideas, I will try. — The Bndr, Jul 18 '11 at 09:21

How does (carrot) clustering work in solr?

1 Answers1