1

I'm completely new to the text extraction concept. When I was searching for an example I found one which has implemented using Lucene. I just tried to run it in eclipse but it gave an error. This is the error I'm getting : (TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow). I directly get the code form a published article on the web and did few modifications because first I wanted to make sure that the code is running without error before understanding it's parts one by one. The original code was getting text from an URL but I changed it to get texts from a defined String (It's in the main method). I also changed the version since I'm using lucene 4.8 version.

I also searched the error and did few modifications but still I'm getting the error. I the code here. Could you please help me to get rid of this error. Where should I modify to avoid the error. This is the link where I get the code http://pastebin.com/jNALz7DZ Here is the code I modified.

public class KeywordsGuesser {

     /** Lucene version. */
     private static Version LUCENE_VERSION = Version.LUCENE_48;

     /**
      * Keyword holder, composed by a unique stem, its frequency, and a set of found corresponding
      * terms for this stem.
      */
    public static class Keyword implements Comparable<Keyword> {

         /** The unique stem. */
         private String stem;

         /** The frequency of the stem. */
         private Integer frequency;

         /** The found corresponding terms for this stem. */
        private Set<String> terms;

         /**
          * Unique constructor.
          * 
          * @param stem The unique stem this instance must hold.
          */
         public Keyword(String stem) {
             this.stem = stem;
            terms = new HashSet<String>();
             frequency = 0;
         }

         /**
          * Add a found corresponding term for this stem. If this term has been already found, it
          * won't be duplicated but the stem frequency will still be incremented.
          * 
          * @param term The term to add.
          */
         private void add(String term) {
             terms.add(term);
             frequency++;
         }

         /**
          * Gets the unique stem of this instance.
          * 
          * @return The unique stem.
          */
         public String getStem() {
             return stem;
         }

         /**
          * Gets the frequency of this stem.
          * 
          * @return The frequency.
          */
         public Integer getFrequency() {
             return frequency;
         }

         /**
          * Gets the list of found corresponding terms for this stem.
          * 
          * @return The list of found corresponding terms.
          */
        public Set<String> getTerms() {
             return terms;
         }

         /**
          * Used to reverse sort a list of keywords based on their frequency (from the most frequent
          * keyword to the least frequent one).
          */
         @Override
         public int compareTo(Keyword o) {
             return o.frequency.compareTo(frequency);
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public boolean equals(Object obj) {
             return obj instanceof Keyword && obj.hashCode() == hashCode();
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public int hashCode() {
             return Arrays.hashCode(new Object[] { stem });
         }

         /**
          * User-readable representation of a keyword: "[stem] x[frequency]".
          */
         @Override
         public String toString() {
             return stem + " x" + frequency;
         }

     }

     /**
      * Stemmize the given term.
      * 
      * @param term The term to stem.
      * @return The stem of the given term.
      * @throws IOException If an I/O error occured.
      */
     private static String stemmize(String term) throws IOException {

         // tokenize term
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(term));
         // stemmize
         tokenStream = new PorterStemFilter(tokenStream);

         Set<String> stems = new HashSet<String>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
         // for each token
         while (tokenStream.incrementToken()) {
             // add it in the dedicated set (to keep unicity)
             stems.add(token.toString());
         }

         // if no stem or 2+ stems have been found, return null
         if (stems.size() != 1) {
             return null;
         }

         String stem = stems.iterator().next();

         // if the stem has non-alphanumerical chars, return null
         if (!stem.matches("[\\w-]+")) {
             return null;
         }

         return stem;
     }

     /**
      * Tries to find the given example within the given collection. If it hasn't been found, the
      * example is automatically added in the collection and is then returned.
      * 
      * @param collection The collection to search into.
      * @param example The example to search.
      * @return The existing element if it has been found, the given example otherwise.
      */
     private static <T> T find(Collection<T> collection, T example) {
         for (T element : collection) {
             if (element.equals(example)) {
                 return element;
             }
         }
         collection.add(example);
         return example;
     }

     /**
      * Extracts text content from the given URL and guesses keywords within it (needs jsoup parser).
      * 
      * @param The URL to read.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occurred.
      * @see <a href="http://jsoup.org/">http://jsoup.org/</a>
      */
     public static List<Keyword> guessFromUrl(String url) throws IOException {
         // get textual content from url
         //Document doc = Jsoup.connect(url).get();
         //String content = doc.body().text();

       String content = url;
         // guess keywords from this content
         return guessFromString(content);
     }

     /**
      * Guesses keywords from given input string.
      * 
      * @param input The input string.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occured.
      */
     public static List<Keyword> guessFromString(String input) throws IOException {

         // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
         input = input.replaceAll("-+", "-0");
         // replace any punctuation char but dashes and apostrophes and by a space
         input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
         // replace most common English contractions
         input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

         // tokenize input
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(input));
         // to lower case
         tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
         // remove dots from acronyms (and "'s" but already done manually above)
         tokenStream = new ClassicFilter(tokenStream);
         // convert any char to ASCII
         tokenStream = new ASCIIFoldingFilter(tokenStream);
         // remove english stop words
         tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());

         List<Keyword> keywords = new LinkedList<Keyword>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

         // for each token
         while (tokenStream.incrementToken()) {
             String term = token.toString();
             // stemmize
             String stem = stemmize(term);
             if (stem != null) {
                 // create the keyword or get the existing one if any
                 Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
                 // add its corresponding initial token
                 keyword.add(term.replaceAll("-0", "-"));
             }
         }



         tokenStream.end();
         tokenStream.close();


         // reverse sort by frequency
         Collections.sort(keywords);

         return keywords;
     }



     public static void main(String args[]) throws IOException{

       String input = "Java is a computer programming language that is concurrent, "
               + "class-based, object-oriented, and specifically designed to have as few "
               + "implementation dependencies as possible. It is intended to let application developers "
               + "write once, run anywhere (WORA), "
               + "meaning that code that runs on one platform does not need to be recompiled "
               + "to run on another. Java applications are typically compiled to byte code (class file) "
               + "that can run on any Java virtual machine (JVM) regardless of computer architecture. "
               + "Java is, as of 2014, one of the most popular programming languages in use, particularly "
               + "for client-server web applications, with a reported 9 million developers."
               + "[10][11] Java was originally developed by James Gosling at Sun Microsystems "
               + "(which has since merged into Oracle Corporation) and released in 1995 as a core "
               + "component of Sun Microsystems' Java platform. The language derives much of its syntax "
               + "from C and C++, but it has fewer low-level facilities than either of them."
               + "The original and reference implementation Java compilers, virtual machines, and "
               + "class libraries were developed by Sun from 1991 and first released in 1995. As of "
               + "May 2007, in compliance with the specifications of the Java Community Process, "
               + "Sun relicensed most of its Java technologies under the GNU General Public License. "
               + "Others have also developed alternative implementations of these Sun technologies, "
               + "such as the GNU Compiler for Java (byte code compiler), GNU Classpath "
               + "(standard libraries), and IcedTea-Web (browser plugin for applets).";

       System.out.println(KeywordsGuesser.guessFromString(input));
     }



 }

This is the error outputted by eclipse

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
    at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.zzRefill(ClassicTokenizerImpl.java:431)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.getNextToken(ClassicTokenizerImpl.java:638)
    at org.apache.lucene.analysis.standard.ClassicTokenizer.incrementToken(ClassicTokenizer.java:140)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at org.apache.lucene.analysis.standard.ClassicFilter.incrementToken(ClassicFilter.java:47)
    at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:104)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
    at beehex.lucene.KeywordsGuesser.guessFromString(KeywordsGuesser.java:239)
    at beehex.lucene.KeywordsGuesser.main(KeywordsGuesser.java:288)

After get rid of the error my output is:

[java x10, develop x5, sun x5, run x4, compil x4, languag x3, implement x3, applic x3, code x3, gnu x3, comput x2, program x2, specif x2, have x2, on x2, platform x2, byte x2, class x2, virtual x2, machin x2, most x2, origin x2, microsystem x2, ha x2, releas x2, 1995 x2, it x2, from x2, c x2, librari x2, technolog x2, concurr x1, class-bas x1, object-ori x1, design x1, few x1, depend x1, possibl x1, intend x1, let x1, write x1, onc x1, anywher x1, wora x1, mean x1, doe x1, need x1, recompil x1, anoth x1, typic x1, file x1, can x1, ani x1, jvm x1, regardless x1, architectur x1, 2014 x1, popular x1, us x1, particularli x1, client-serv x1, web x1, report x1, 9 x1, million x1, 10 x1, 11 x1, jame x1, gosl x1, which x1, sinc x1, merg x1, oracl x1, corpor x1, core x1, compon x1, deriv x1, much x1, syntax x1, fewer x1, low-level x1, facil x1, than x1, either x1, them x1, refer x1, were x1, 1991 x1, first x1, mai x1, 2007 x1, complianc x1, commun x1, process x1, relicens x1, under x1, gener x1, public x1, licens x1, other x1, also x1, altern x1, classpath x1, standard x1, icedtea-web x1, browser x1, plugin x1, applet x1]

Abhineet Verma
  • 1,008
  • 8
  • 18

1 Answers1

2

You need to reset the TokenStream object before you call the incrementToken method on it, as the error points out :

// add this line
tokenStream.reset();
while (tokenStream.incrementToken()) {
....
omu_negru
  • 4,642
  • 4
  • 27
  • 38
  • Thank you Very much. I'm getting the out put. I referred this post published on stackoverflow site as well. http://stackoverflow.com/questions/17447045/java-library-for-keywords-extraction-from-input-text Is there any way to display the output as mentioned in this post ? And I modified my main method into System.out.println(KeywordsGuesser.guessFromString(input)); – user3774248 Jun 25 '14 at 09:20
  • I would suggest taking a look at the lucene demo module : http://lucene.apache.org/core/4_8_0/demo/overview-summary.html if you want to learn more about lucene, or just grab an information retrieval book and read that if you want to learn about inverted indexes, stemming etc... – omu_negru Jun 25 '14 at 09:34