I have an (imperfectly) clustered string data, where the items in one cluster might look like this:
[
Yellow ripe banana very tasty,
Yellow ripe banana with little dots,
Green apple with little dots,
Green ripe banana - from the market,
Yellow ripe banana,
Nice yellow ripe banana,
Cool yellow ripe banana - my favourite,
Yellow ripe,
Yellow ripe
],
where the optimal title would be 'Yellow ripe banana'.
Currently, I am using simple heuristics - choosing the most common, or the shortest name if tie, - with the help of SQL GROUP BY. My data contains a large amount of such clusters, they change frequently, and, every time a new fruit is added to or removed from the cluster, the title for the cluster has to be re-calculated.
I would like to improve two things:
(1) Efficiency - e.g., compare the new fruit name to the title of the cluster only, and avoid grouping / phrase clustering of all fruit titles each time.
(2) Precision - instead of looking for the most common complete name, I would like to extract the most common phrase. The current algorithm would choose 'Yellow ripe', which repeats 2 times and is the most common complete phrase; however, as the phrase, 'Yellow ripe banana' is the most common in the given set.
I am thinking of using Solr + Carrot2 (got no experience with the second). At this point, I do not need to cluster the documents - they are already clustered based on other parameters - I only need to choose the central phrase as the center/title of the cluster.
Any input is very appreciated, thanks!