0

Possible Duplicate:
How to get frequently occuring phrases with Lucene

I need to find most occuring words or word groups in an index which means most occuring text might be the word itself or a word group. Much similar to twitters trending topic (without hashtag entities ofcourse). Does Lucene provide some sort of method to do so or how can I achieve this in a massive data. If the question is unclear I can give examples to be more specific. I'm using java by the way and Lucene 3.5.

And a quick edit the "word group" can contain max 3 words. Let's say in a big text I have the word "is" 500 times "weather" 100 times "nice" 300 times and the word group "weather is nice" 90 times. I need to find if the occurence of "weather is nice" is important for me. And of course I need to look every indexed word...

Thank you.

Community
  • 1
  • 1
FDem
  • 25
  • 8

1 Answers1

1

If you want to find most occurring sequences of consecutive tokens of maximum length 3, the problem can be seen as a search for most frequent N-grams, as discussed in the question How to get frequently occurring phrases with Lucene
In your case you probably don't need Solr, see this little code, you just have to count each of the generated N-grams and keep the ones appearing a number of times greater than a desired threshold. The problem of counting efficiently these Ngrams is more difficult. If they aren't a lot (e.g. less tham 1~2M) you can just use an HashMap.
If there's a greater amount you can try with the interesting count min sketch algorithm, there's an implementation but personally I've never used it and don't know how good it is.

Community
  • 1
  • 1
Jacopofar
  • 3,407
  • 2
  • 19
  • 29