I am using Galago retrieval toolkit (a part of the Lemur project) and I need to have a list of all vocabulary terms in the collection (all unique terms). Actually I need a List <String>
or Set <String>
I really appreciate to let me know how can I obtain such a list?
Asked
Active
Viewed 243 times
2

John Foley
- 957
- 9
- 19

boomz
- 657
- 5
- 21
1 Answers
1
The `DumpKeysFn' class seems to give all the keys (unique terms) of the collection. The code should be like this:
public static Set <String> getAllVocabularyTerms (String fileName) throws IOException{
Set <String> result = new HashSet<> ();
IndexPartReader reader = DiskIndex.openIndexPart(fileName);
if (reader.getManifest().get("emptyIndexFile", false)) {
// do something!
}
KeyIterator iterator = reader.getIterator();
while (!iterator.isDone()) {
result.add(iterator.getKeyString());
iterator.nextKey();
}
reader.close();
return result;
}

boomz
- 657
- 5
- 21
-
2I would just add that to use this, the filename you're likely to be passing is "postings.krovetz" if you want stemmed terms or "postings" if you want unstummed terms. Typical Java feedback: use a try-with-resources block instead of explicit close calls. – John Foley Nov 19 '15 at 13:10