My program needs to index with Lucene (4.10) unstructured documents which contents can be anything. So my custom Analyzer is making use of the ClassicTokenizer to first tokenize the documents.
Yet it does not completely fit my needs because for example I want to be able to search for parts of an email address or part of a serial number (can also be a telephone number or anything containing numbers) that can be written as 1234.5678.9012 or 1234-5678-9012 depending on who wrote the document being indexed.
Since this ClassicTokenizer recognizes email and treats points followed by numbers as a whole token, it ends up that the generated index includes email addresses as a whole and serial numbers as a whole too whereas I would also like to break those tokens into pieces to enable the user to later search for those pieces.
Let me give a concrete example : if the input document features xyz@gmail.com, the ClassicTokenizer recognizes it as an email and consequently tokenizes it as xyz@gmail.com. If the user searches for xyz they will find nothing whereas a search for xyz@gmail.com will yield the expected result.
After reading lots of blog postings or SO question I to the conclusion that one solution could be to use a TokenFilter that would split the email into its pieces (on each side of @ sign). Please not that I don't want to create my own tokenizer with JFlex and co.
Dealing with email I wrote the following code inspired from Lucene in action 2nd Edition's Synonymfilter :
public class SymbolSplitterFilter extends TokenFilter {
private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;
public SymbolSplitterFilter(TokenStream in) {
super(in);
termStack = new Stack<>();
termAtt = addAttribute(CharTermAttribute.class);
posIncAtt = addAttribute(PositionIncrementAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
}
final String currentTerm = termAtt.toString();
System.err.println("The original word was " + termAtt.toString());
final int bufferLength = termAtt.length();
if (bufferLength > 1 && currentTerm.indexOf("@") > 0) { // There must be sth more than just @
// If this is the first pass we fill in the stack with the terms
if (termStack.isEmpty()) {
// We split the token abc@cd.com into abc and cd.com
termStack.addAll(Arrays.asList(currentTerm.split("@")));
// Now we have the constituting terms of the email in the stack
System.err.println("The terms on the stacks are ");
for (int i = 0; i < termStack.size(); i++) {
System.err.println(termStack.get(i));
/** The terms on the stacks are
* xyz
* gmail.com
*/
}
// I am not sure it is the right place for this.
current = captureState();
} else {
// This part seems to never be reached!
// We add the constituents terms as tokens.
String part = termStack.pop();
System.err.println("Current part is " + part);
restoreState(current);
termAtt.setEmpty().append(part);
posIncAtt.setPositionIncrement(0);
}
}
System.err.println("In the end we have " + termAtt.toString());
// In the end we have xyz@gmail.com
return true;
}
}
Please note : I just started with the email that's why I only showed that part of code but I'll have to enhance my code to also manage serial numbers (as explained earlier)
However the stack is never processed. Indeed I can't figure out how the incrementToken method works although I read this SO question and when it processes the given token from the TokenStream.
Finally the goal I want to achieve is : for xyz@gmail.com as input text, I want to generate the following subtokens : xyz@gmail.com xyz gmail.com
Any help appreciated,