Modifying stop words list

Question

I would like to tune the carrot2 clusters to avoid labels, that do not start with prepositions -- for Russian language it looks quite strange to see a word in a grammatical case (non Nominative) and not have a preposition.

The clustering is done using Apache Solr.

Examples:

Минске ([in] Minsk, missing preposition В in the beginning).
Самом Деле ([in] fact, missing preposition На in the beginning).

I have tried two independent things:

configure core/clustering/carrot2/stopwords.ru -- and remove prepositions in questions from there
unpack carrot2-mini-3.9.0.jar, remove entries from stopwords.ru and pack back into the jar.

None of the above has any effect on the cluster labels. Is there some other obvious thing to try? Or perhaps, change the approach of tuning altogether?

Thank you!

Stanislaw Osinski · Accepted Answer · 2018-10-17T11:15:27.290

1

Removing prepositions from the stop words files should do the trick. With the modified stop words files, the prepositions can still be missing due to the statistics of the data -- if some occurrences of Минске are prefixed with "in" and others are not, the algorithm may pick the shorter version (without prepositions) as the more representative.

Labels in core/clustering/carrot2/stopwords.ru should take precedence over the labels contained in carrot2-mini-3.9.0.jar.

When it comes to the Lingo clustering algorithm, there's no direct way to directly affect the number of words per label, but you can try increasing phrase label boost and lowering truncated label threshold.

A complete list of clustering algorithm parameters is in Carrot2 documentation. You can pass parameter overrides as part of Solr results clustering requests.

edited Oct 17 '18 at 11:15

answered Oct 16 '18 at 08:08

Stanislaw Osinski

1,231
1
7
9

thank you! Should the first approach with core/clustering/carrot2/stopwords.ru work? I.e. no need to modify the carrot2-mini-3.9.0.jar ? Statistics makes total sense. Do you have some additional to the carrot2 documentation to affect on the labels? Can I control number of words per label? – D_K Oct 16 '18 at 10:19
1

I've edited the answer to add the extra information. – Stanislaw Osinski Oct 17 '18 at 11:15
thanks Stanislaw! I will test these out and share. Marked as answer for now. – D_K Oct 17 '18 at 14:40
just wanted to clarify parameter passing: prefix with carrot.phraseLabelBoost or clustering.phraseLabelBoost or some other way? Btw, clustering debug sessions could benefit from having a statistics output of certain parameters on this result set. – D_K Oct 18 '18 at 04:15
apparently from the documentation, LingoClusteringAlgorithm.phraseLabelBoost is the way to pass to Solr. – D_K Oct 18 '18 at 04:33
using StopWordLabelFilter.enabled=false has made the first label be "В Минске" (In Minsk). Can there be a bug in solr 5.1.0 based implementation that ignores custom stopwords? – D_K Oct 18 '18 at 04:37
1

Thanks for further tests. I'll double check the custom stop words issue, it's weird indeed. – Stanislaw Osinski Oct 18 '18 at 16:48

Modifying stop words list

1 Answers1