I am trying to create an inverted index for Wikipedia pages however I keep running out of memory. I am not sure what else I can to do ensure it doesn't run out of memory. However we are talking about 3.9Mil words.
indexer.java
public void index() {
ArrayList<Page> pages = parse(); // Parse XML pages
HashMap<String, ArrayList<Integer>> postings = getPostings(pages);
}
public HashMap<String, ArrayList<Integer>> getPostings(ArrayList<Page> pages) {
assert pages != null;
englishStemmer stemmer = new englishStemmer();
HashSet<String> stopWords = getStopWords();
HashMap<String, ArrayList<Integer>> postings = new HashMap<>();
int count = 0;
int artCount = 0;
for (Page page : pages) {
if (!page.isRedirect()) { // Skip pages that are redirects.
StringBuilder sb = new StringBuilder();
artCount = count; // All the words until now
boolean ignore = false;
for (char c : page.getText().toCharArray()) {
if (c == '<') // Ignore words inside <> tags.
ignore = true;
if (!ignore) {
if (c != 39) {
if (c > 47 && c < 58 || c > 96 && c < 123) // Character c is a number 0-9 or a lower case letter a-z.
sb.append(c);
else if (c > 64 && c < 91) // Character c is an uppercase letter A-Z.
sb.append(Character.toLowerCase(c));
else if (sb.length() > 0) { // Check if there is a word up until now.
if (sb.length() > 1) { // Ignore single character "words"
if (!stopWords.contains(sb.toString())) { // Check if the word is not a stop word.
stemmer.setCurrent(sb.toString());
stemmer.stem(); // Stem word s
String s = sb.toString(); // Retrieve the stemmed word
if (!postings.containsKey(s)) // Check if the word already exists in the words map.
postings.put(s, new ArrayList<>()); // If the word is not in the map then create an array list for that word.
postings.get(s).add(page.getId()); // Place the id of the page in the word array list.
count++; // Increase the overall word count for the pages
}
}
sb = new StringBuilder();
}
}
}
if (c == '>')
ignore = false;
}
}
page.setCount(count - artCount);
}
System.out.println("Word count:" + count);
return postings;
}
Advantages
Some advantages for this approach are:
- You can get the number of occurrences of a given word simply by getting the size of the associated ArrayList.
- Looking up the number of times a given word occurs in a page is relatively easy.
Optimizations
Current optimizations:
- Ignoring common words (stop words).
- Stemming words to their roots and storing those.
- Ignoring common Wikipedia tags that aren't English words (included in stop word list such as: lt, gt, ref .. etc).
- Ignoring text within
< >
tags such as:<pre>, <div>
Limitations
Array lists become incredibly large with number of occurrences for words, the major disadvantage of this approach comes when an array list has to grow. A new array list is created and the items from the previous array list need to be copied into the new array list. This could be a possible performance bottleneck. Would a Linked list make more sense here? As we are simply adding more occurrences and not reading the occurrences. This would also mean that since linked lists do not rely on an array as their underlying data structure they can grow without bounds and do not need to be replaced when they are too large.
Alternative approaches
I have considered dumping the counts for each word into a database like MongoDB after each page has been processed and then append the new occurrences. It would be: {word : [occurrences]}
and then let the GC clean postings
HashMap after each page has been processed.
I've also considered moving the pages loop into the index()
method such that GC can clean up getPostings()
before a new page. Then merging the new postings
after each page but I don't think that will alleviate the memory burden.
As for the hash maps would a tree map be a better fit for this situation?
Execution
On my machine this program runs on all 4 cores using 90 - 100% and takes about 2-2.5GB RAM. It runs for over an hour and a half then: GC Out of memory
.
I have also considered increasing the available memory for this program but it needs to run on my instructors machine as well. So it needs to operate as standard without any "hacks".
I need help making considerable optimizations, I'm not sure what else would help.