0

Is there a way to search for frequent phrases with Lucene?

I'm searching successfully for frequent words:

TermStats[] ts = HighFreqTerms.getHighFreqTerms(reader, 20, fieldName, comparator);

but this brings single words, and I'm looking for a way to search for frequent two (or any number) word combinations.

To clarify, I'm not looking for top two words I know of (for example fast and car) but top two frequent word combinations. So if my text is "this is a fast car and this is also a fast car" I'll get as a result that "fast car" and "this is" are the top two word combinations.

I looked at the discussion here but it offers a solution with solr and I'm looking for something with Lucene, and in any case the relevant link is broken.

EDIT: following femtoRgon's comment here's some code from my Analyzer. Is this where the ShingleFilter should be added? It doesn't seem to work as my output looks like this:

ed d 
d 
d   p
 p
 p pl  
pl
pl le

What I need is for the output to include pairs of full words.

Here's my createComponents method:

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 2, 2);     
    ShingleFilter sf = new ShingleFilter(source, 2, 2);

    TokenStreamComponents tsc = new TokenStreamComponents(source, sf);  
    return tsc;
}

EDIT2: I changed the NGramTokenizer to StandardTokenizer following femtoRgon's comment and now I'm getting full words, but I don't need the single words, just the pairs.

This is the code:

Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader);        
ShingleFilter sf = new ShingleFilter(source, 2, 2);

Note the 2, 2 which according to the documents should generate min words of 2, and max words of 2. But in fact it generates this output:

and
and other
other
other airborne
airborne
airborne particles

So how do I get rid of the single words and get this output?

and other
other airborne
airborne particles
Community
  • 1
  • 1
Eddy
  • 3,533
  • 13
  • 59
  • 89
  • [ShingleFilter](https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html) is definitely still around, and is my first thought, as well. – femtoRgon Jul 30 '15 at 15:59
  • @femtoRgon cool, thanks, I'll give it a try. But how do I then go about getting the most frequent ones? – Eddy Jul 30 '15 at 16:17
  • `HighFreqTerms` should work just fine. You index shingles of the appropriate sizes, which gives you multi-word terms in the index, then you just check for the high frequency ones. – femtoRgon Jul 30 '15 at 16:23
  • @femtoRgon I edited the question to reflect my attempt at using the ShingleFilter. It doesn't seem to work but maybe I'm not doing it correctly. – Eddy Jul 30 '15 at 19:40
  • `NGramTokenizer` is not a good tokenizer to use with `ShingleFilter`. You'll want to use something that separates into words, `StandardTokenizer`, for instance. – femtoRgon Jul 30 '15 at 20:00
  • @femtoRgon ok, cool, this gives me full words. But I still want to get rid of the single words. I edited my question (EDIT2) with the new info. – Eddy Jul 31 '15 at 05:11
  • [setOutputUnigrams](https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html#setOutputUnigrams(boolean)) – femtoRgon Jul 31 '15 at 06:21
  • Cool, thanks. if you write this as an answer I'll happily accept it. – Eddy Jul 31 '15 at 07:28

1 Answers1

0

Here's my full Analyzer class that does the job. Note that the TokenStreamComponents method is where the ShingleFilter is declared following femtoRgon excellent comments to my question. Just put in your own string, specify minWords and maxWords and run it.

public class RMAnalyzer extends Analyzer {

    public static String someString = "some string";
    private int minWords = 2;
    private int maxWords = 2;


    public static void main(String[] args) {
        RMAnalyzer rma = new RMAnalyzer(2, 2);
        rma.findFrequentTerms();
        rma.close();
    }


    public RMAnalyzer(int minWords, int maxWords) { 
        this.minWords = minWords;
        this.maxWords = maxWords;
    }

    public void findFrequentTerms() {
        StringReader sr = new StringReader(someString);
        try {
            TokenStream tokenStream = tokenStream("title", sr);
            OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
            CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
            tokenStream.reset();

            while (tokenStream.incrementToken()) {
                String term = charTermAttribute.toString();
                System.out.println(term);
            }                       
        } catch(Exception e) {
            e.printStackTrace();
        }
    }


    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new StandardTokenizer(Version.LUCENE_47, reader);        
        ShingleFilter sf = new ShingleFilter(source, minWords, maxWords);
        sf.setOutputUnigrams(false); // makes it so no one word phrases out in the output.
        sf.setOutputUnigramsIfNoShingles(true); // if not enough for min, show anyway.

        TokenStreamComponents tsc = new TokenStreamComponents(source, sf);  
        return tsc;
    }
}
Eddy
  • 3,533
  • 13
  • 59
  • 89