76

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.

The worst part is that I'm looking at the comments in the JavaDocs that address my question.

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29

Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.

Can anyone explain how to get token-like information from a TokenStream?

Eric Wilson
  • 57,719
  • 77
  • 200
  • 270

4 Answers4

118

Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

while (tokenStream.incrementToken()) {
    int startOffset = offsetAttribute.startOffset();
    int endOffset = offsetAttribute.endOffset();
    String term = termAttribute.term();
}

Edit: The new way

According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

tokenStream.reset();
while (tokenStream.incrementToken()) {
    int startOffset = offsetAttribute.startOffset();
    int endOffset = offsetAttribute.endOffset();
    String term = charTermAttribute.toString();
}
Enno Shioji
  • 26,542
  • 13
  • 70
  • 109
Adam Paynter
  • 46,244
  • 33
  • 149
  • 164
  • 6
    Now TermAttribute is depricated. As I can see we can use something like `CharTermAttributeImpl.toString()` instead – Donotello Aug 16 '11 at 09:11
  • 6
    You should use addAttribute rather than getAttribute. From lucene javadocs: "It is recommended to always use addAttribute(java.lang.Class) even in consumers of TokenStreams, because you cannot know if a specific TokenStream really uses a specific Attribute" http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/util/AttributeSource.html#getAttribute(java.lang.Class) – jpountz Apr 11 '12 at 22:29
  • 1
    @jpountz: Thanks for the tip! I have modified the answer accordingly. – Adam Paynter Apr 12 '12 at 08:00
  • 2
    Had to call `reset()` with Lucene 4.3 so took the liberty of adding it – Enno Shioji Aug 26 '13 at 21:47
  • Finally, I don't see the answer on the post question: "How to get a **Token** from a Lucene TokenStream?" – serhio Apr 07 '14 at 11:43
  • @serhio: I added a supplementary answer that hopefully addresses your concern – William Price Apr 18 '14 at 15:08
  • You are missing `tokenStream.end()` and `tokenStream.close()` required by the [TokenStream workflow](http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/analysis/TokenStream.html). – Florent Guillaume Apr 23 '14 at 21:22
  • This code will skip the first term , how to print the first term – user2478236 Apr 12 '17 at 11:32
41

This is how it should be (a clean version of Adam's answer):

TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
  System.out.println(cattr.toString());
}
stream.end();
stream.close();
Enno Shioji
  • 26,542
  • 13
  • 70
  • 109
yegor256
  • 102,010
  • 123
  • 446
  • 597
  • 10
    Your code did not function properly until I added a stream.reset() before the while loop. I am using Lucene 4.0, so that may be a recent change. Refer to the example near the bottom of this page: http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/package-summary.html –  Jan 09 '13 at 21:29
  • Tried to edit to add the reset() call, which avoids an NPE inside Lucene at incrementToken(), but all but one peer rejected the edit as incorrect. The Lucene docs explictly say that "The consumer calls reset()" prior to "The consumer calls incrementToken()" in the [TokenStream API](http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html) – William Price Jul 12 '13 at 16:05
  • Also had to call `reset()` with Lucene 4.3 so I took the liberty of adding it – Enno Shioji Aug 26 '13 at 21:46
  • maybe the question is odd, but, finally, is not very clear how to obtain the next **Token** (not the next string)? – serhio Apr 07 '14 at 08:45
3

For the latest version of lucene 7.3.1

    // Test the tokenizer
    Analyzer testAnalyzer = new CJKAnalyzer();
    String testText = "Test Tokenizer";
    TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));
    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
    try {
        ts.reset(); // Resets this stream to the beginning. (Required)
        while (ts.incrementToken()) {
            // Use AttributeSource.reflectAsString(boolean)
            // for token stream debugging.
            System.out.println("token: " + ts.reflectAsString(true));

            System.out.println("token start offset: " + offsetAtt.startOffset());
            System.out.println("  token end offset: " + offsetAtt.endOffset());
        }
        ts.end();   // Perform end-of-stream operations, e.g. set the final offset.
    } finally {
        ts.close(); // Release resources associated with this stream.
    }

Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html

Flamingo
  • 1,573
  • 1
  • 10
  • 4
1

There are two variations in the OP question:

  1. What is "the process to obtain Tokens from a TokenStream"?
  2. "Can anyone explain how to get token-like information from a TokenStream?"

Recent versions of the Lucene documentation for Token say (emphasis added):

NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.

And TokenStream says its API:

... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.

The other answers to this question cover #2 above: how to get token-like information from a TokenStream in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.

But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token if you really want/need that type?

With the same API change that makes TokenStream an AttributeSource, Token now implements Attribute and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute and OffsetAttribute. So they really did answer that part of the original question, they simply didn't show it.

It is important that while this approach will allow you to access Token while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken() will change the state of the Token returned from addAttribute; So if your goal is to build a collection of different Token objects to be used outside the loop then you will need to do extra work to make a new Token object as a (deep?) copy.

William Price
  • 4,033
  • 1
  • 35
  • 54