I would like to tune the carrot2 clusters to avoid labels, that do not start with prepositions -- for Russian language it looks quite strange to see a word in a grammatical case (non Nominative) and not have a preposition.
The clustering is done using Apache Solr.
Examples:
Минске ([in] Minsk, missing preposition В in the beginning).
Самом Деле ([in] fact, missing preposition На in the beginning).
I have tried two independent things:
- configure core/clustering/carrot2/stopwords.ru -- and remove prepositions in questions from there
- unpack carrot2-mini-3.9.0.jar, remove entries from stopwords.ru and pack back into the jar.
None of the above has any effect on the cluster labels. Is there some other obvious thing to try? Or perhaps, change the approach of tuning altogether?
Thank you!