4

My program needs to index with Lucene (4.10) unstructured documents which contents can be anything. So my custom Analyzer is making use of the ClassicTokenizer to first tokenize the documents.

Yet it does not completely fit my needs because for example I want to be able to search for parts of an email address or part of a serial number (can also be a telephone number or anything containing numbers) that can be written as 1234.5678.9012 or 1234-5678-9012 depending on who wrote the document being indexed.

Since this ClassicTokenizer recognizes email and treats points followed by numbers as a whole token, it ends up that the generated index includes email addresses as a whole and serial numbers as a whole too whereas I would also like to break those tokens into pieces to enable the user to later search for those pieces.

Let me give a concrete example : if the input document features xyz@gmail.com, the ClassicTokenizer recognizes it as an email and consequently tokenizes it as xyz@gmail.com. If the user searches for xyz they will find nothing whereas a search for xyz@gmail.com will yield the expected result.

After reading lots of blog postings or SO question I to the conclusion that one solution could be to use a TokenFilter that would split the email into its pieces (on each side of @ sign). Please not that I don't want to create my own tokenizer with JFlex and co.

Dealing with email I wrote the following code inspired from Lucene in action 2nd Edition's Synonymfilter :

public class SymbolSplitterFilter extends TokenFilter {

private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;

public SymbolSplitterFilter(TokenStream in) {
    super(in);
    termStack = new Stack<>();
    termAtt = addAttribute(CharTermAttribute.class);
    posIncAtt = addAttribute(PositionIncrementAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {
    if (!input.incrementToken()) {
        return false;
    }

    final String currentTerm = termAtt.toString();

    System.err.println("The original word was " + termAtt.toString());
    final int bufferLength = termAtt.length();

    if (bufferLength > 1 && currentTerm.indexOf("@") > 0) { // There must be sth more than just @
        // If this is the first pass we fill in the stack with the terms
        if (termStack.isEmpty()) {
            // We split the token abc@cd.com into abc and cd.com
            termStack.addAll(Arrays.asList(currentTerm.split("@")));
            // Now we have the constituting terms of the email in the stack
            System.err.println("The terms on the stacks are ");
            for (int i = 0; i < termStack.size(); i++) {
                System.err.println(termStack.get(i));
                /** The terms on the stacks are 
                * xyz
                * gmail.com
                */

            }

            // I am not sure it is the right place for this.
             current = captureState();

        } else {
            // This part seems to never be reached!
            // We add the constituents terms as tokens.
            String part = termStack.pop();
            System.err.println("Current part is " + part);
            restoreState(current);
            termAtt.setEmpty().append(part);                 
            posIncAtt.setPositionIncrement(0);
        }
    }

    System.err.println("In the end we have " + termAtt.toString());
    // In the end we have xyz@gmail.com
    return true;

}

}

Please note : I just started with the email that's why I only showed that part of code but I'll have to enhance my code to also manage serial numbers (as explained earlier)

However the stack is never processed. Indeed I can't figure out how the incrementToken method works although I read this SO question and when it processes the given token from the TokenStream.

Finally the goal I want to achieve is : for xyz@gmail.com as input text, I want to generate the following subtokens : xyz@gmail.com xyz gmail.com

Any help appreciated,

HelloWorld
  • 2,275
  • 18
  • 29

1 Answers1

2

Your Problem is, that the input TokenStream is already exhausted when your Stack is filled the first time. So input.incrementToken() returns false. You should check whether the stack is filled first before incrementing the input. Like so:

public final class SymbolSplitterFilter extends TokenFilter {

private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAtt;

public SymbolSplitterFilter(TokenStream in)
{
    super(in);
    termStack = new Stack<>();
    termAtt = addAttribute(CharTermAttribute.class);
    posIncAtt = addAttribute(PositionIncrementAttribute.class);
    typeAtt = addAttribute(TypeAttribute.class);
}

@Override
public boolean incrementToken() throws IOException
{
    if (!this.termStack.isEmpty()) {
        String part = termStack.pop();
        restoreState(current);
        termAtt.setEmpty().append(part);
        posIncAtt.setPositionIncrement(0);
        return true;
    } else if (!input.incrementToken()) {
        return false;
    } else {
        final String currentTerm = termAtt.toString();
        final int bufferLength = termAtt.length();

        if (bufferLength > 1 && currentTerm.indexOf("@") > 0) { // There must be sth more than just @
            if (termStack.isEmpty()) {
                termStack.addAll(Arrays.asList(currentTerm.split("@")));
                current = captureState();
            }
        }
        return true;

    }

}
}

Note, that you might possibly want to correct your offsets as well and change the order of your tokens as the test shows your resulting tokens:

 public class SymbolSplitterFilterTest extends BaseTokenStreamTestCase {


@Test
public void testSomeMethod() throws IOException
{
    Analyzer analyzer = this.getAnalyzer();
    assertAnalyzesTo(analyzer, "hey xyz@example.com",
        new String[]{"hey", "xyz@example.com", "example.com", "xyz"},
        new int[]{0, 4, 4, 4},
        new int[]{3, 19, 19, 19},
        new String[]{"word", "word", "word", "word"},
        new int[]{1, 1, 0, 0}
        );
}

 private Analyzer getAnalyzer()
{
    return new Analyzer()
    {
        @Override
        protected Analyzer.TokenStreamComponents createComponents(String fieldName)
        {
            Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
            SymbolSplitterFilter testFilter = new SymbolSplitterFilter(tokenizer);
            return new Analyzer.TokenStreamComponents(tokenizer, testFilter);
        }
    };
}

}
Konrad Lötzsch
  • 467
  • 7
  • 15
  • Thanks Konrad, that makes all sense. Actually I was surprised by the fact that in Lucene book they increment the input in the end whereas on other snippets found on the web the increment is done first. I'll read all that again with the lights you shed! I'll keep you posted! – HelloWorld Jun 06 '17 at 20:53
  • I corrected the offsets as you advised me to do, but I am not sure it is effective. I added `offsetAtt = addAttribute(OffsetAttribute.class);` and then after restoring the originalToken state ` final String copyOriginalToken = termAtt.toString();` `termAtt.setEmpty().append(part);` `posIncAtt.setPositionIncrement(0);` `final int partStart = copyOriginalToken.indexOf(part);` `final int partEnd = partStart + part.length();` `offsetAtt.setOffset(partStart, partEnd);` But highlighting does not work! – HelloWorld Jun 07 '17 at 07:46
  • my guess is that is should be final int partStart = copyOriginalToken.indexOf(part) + offsetAtt.startOffset(); – Konrad Lötzsch Jun 07 '17 at 07:52
  • Thanks for your answer Konrad, although it is not enough. Highlighting works for xyz@gmail.com but still does not for xyz or gmail.com. – HelloWorld Jun 07 '17 at 10:22
  • Actually you were fully right Konrad, I just forgot to use the same analyzer (involving the new tokenfilter) at search time. Using your tips both at index and search times achieved my goal! Thanks a ton again! – HelloWorld Jun 10 '17 at 18:45