Preserving emails while tokenizing based on . with lucene

Question

Would like to tokenize strings based on . , ; etc however would like to preserve email addresses, ip addresses and the likes. How do i use an analyzer with lucence to do this task? The following code which i found on stackoverflow does not preserve emails. Any pointers to documentation on how to use the pattern specification feature of StandardAnalyzer of lucene will also be helpful. Thanks much

   String text 
         = "Lucene is simple yet powerful java based search library. sitaraman@dataguise.com";
      Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

      TokenStream tokenStream = analyzer.tokenStream(
         LuceneConstants.CONTENTS, new StringReader(text));

      TermAttribute term = tokenStream.addAttribute(TermAttribute.class);

      while(tokenStream.incrementToken()) {
         System.out.print("[" + term.term() + "] ");

score 0 · Answer 1 · answered Jun 24 '16 at 14:42

0

ClassicAnalyzer, which was the StandardAnalyzer before version 3.1, handles email addresses and IP addresses in the way you are looking for.

It's less refined on text segmentation in general than StandardAnalyzer (especially for non-European languages), but works well for your test case.

answered Jun 24 '16 at 14:42

femtoRgon

32,893
7
60
87

@Sitaraman - That seems to introduce ambiguity to me. What's the difference between "gmail.com" and "library.Abarne"? – femtoRgon Jun 24 '16 at 16:41
I am wondering if i can give an email regular expression that allows lucene to tell the difference between a period in the context of an email and one in a more general context.. – STEMExchanger Jun 24 '16 at 17:34

Preserving emails while tokenizing based on . with lucene

1 Answers1