2

I am using Galago retrieval toolkit (a part of the Lemur project) and I need to have a list of all vocabulary terms in the collection (all unique terms). Actually I need a List <String> or Set <String> I really appreciate to let me know how can I obtain such a list?

John Foley
  • 957
  • 9
  • 19
boomz
  • 657
  • 5
  • 21

1 Answers1

1

The `DumpKeysFn' class seems to give all the keys (unique terms) of the collection. The code should be like this:

public static Set <String> getAllVocabularyTerms (String fileName) throws IOException{
    Set <String> result = new HashSet<> ();
    IndexPartReader reader = DiskIndex.openIndexPart(fileName);
    if (reader.getManifest().get("emptyIndexFile", false)) {
        // do something!
    }

    KeyIterator iterator = reader.getIterator();
    while (!iterator.isDone()) {
      result.add(iterator.getKeyString());
      iterator.nextKey();
    }
    reader.close();
    return result;
}
boomz
  • 657
  • 5
  • 21
  • 2
    I would just add that to use this, the filename you're likely to be passing is "postings.krovetz" if you want stemmed terms or "postings" if you want unstummed terms. Typical Java feedback: use a try-with-resources block instead of explicit close calls. – John Foley Nov 19 '15 at 13:10