Is there a way to search for frequent phrases with Lucene
?
I'm searching successfully for frequent words:
TermStats[] ts = HighFreqTerms.getHighFreqTerms(reader, 20, fieldName, comparator);
but this brings single words, and I'm looking for a way to search for frequent two (or any number) word combinations.
To clarify, I'm not looking for top two words I know of (for example fast and car) but top two frequent word combinations. So if my text is "this is a fast car and this is also a fast car" I'll get as a result that "fast car" and "this is" are the top two word combinations.
I looked at the discussion here but it offers a solution with solr
and I'm looking for something with Lucene
, and in any case the relevant link is broken.
EDIT: following femtoRgon's comment here's some code from my Analyzer
. Is this where the ShingleFilter
should be added? It doesn't seem to work as my output looks like this:
ed d
d
d p
p
p pl
pl
pl le
What I need is for the output to include pairs of full words.
Here's my createComponents
method:
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 2, 2);
ShingleFilter sf = new ShingleFilter(source, 2, 2);
TokenStreamComponents tsc = new TokenStreamComponents(source, sf);
return tsc;
}
EDIT2: I changed the NGramTokenizer
to StandardTokenizer
following femtoRgon's comment and now I'm getting full words, but I don't need the single words, just the pairs.
This is the code:
Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader);
ShingleFilter sf = new ShingleFilter(source, 2, 2);
Note the 2, 2
which according to the documents should generate min words of 2, and max words of 2. But in fact it generates this output:
and
and other
other
other airborne
airborne
airborne particles
So how do I get rid of the single words and get this output?
and other
other airborne
airborne particles