Optimize inverted index Java

Question

I am trying to create an inverted index for Wikipedia pages however I keep running out of memory. I am not sure what else I can to do ensure it doesn't run out of memory. However we are talking about 3.9Mil words.

indexer.java

public void index() {
    ArrayList<Page> pages = parse(); // Parse XML pages
    HashMap<String, ArrayList<Integer>> postings = getPostings(pages);
}

public HashMap<String, ArrayList<Integer>> getPostings(ArrayList<Page> pages) {
    assert pages != null;

    englishStemmer stemmer = new englishStemmer();
    HashSet<String> stopWords = getStopWords();
    HashMap<String, ArrayList<Integer>> postings = new HashMap<>();
    int count = 0;
    int artCount = 0;

    for (Page page : pages) {

        if (!page.isRedirect()) { // Skip pages that are redirects.

            StringBuilder sb = new StringBuilder();
            artCount = count; // All the words until now
            boolean ignore = false;

            for (char c : page.getText().toCharArray()) {

                if (c == '<') // Ignore words inside <> tags.
                    ignore = true;

                if (!ignore) {
                    if (c != 39) {
                        if (c > 47 && c < 58 || c > 96 && c < 123) // Character c is a number 0-9 or a lower case letter a-z.
                            sb.append(c);

                        else if (c > 64 && c < 91) // Character c is an uppercase letter A-Z.
                            sb.append(Character.toLowerCase(c));

                        else if (sb.length() > 0) { // Check if there is a word up until now.

                            if (sb.length() > 1) { // Ignore single character "words"

                                if (!stopWords.contains(sb.toString())) { // Check if the word is not a stop word.

                                    stemmer.setCurrent(sb.toString());
                                    stemmer.stem(); // Stem word s

                                    String s = sb.toString(); // Retrieve the stemmed word

                                    if (!postings.containsKey(s)) // Check if the word already exists in the words map.
                                        postings.put(s, new ArrayList<>()); // If the word is not in the map then create an array list for that word.
                                    postings.get(s).add(page.getId()); // Place the id of the page in the word array list.
                                    count++; // Increase the overall word count for the pages
                                }
                            }
                            sb = new StringBuilder();
                        }
                    }
                }

                if (c == '>')
                    ignore = false;
            }
        }
        page.setCount(count - artCount);
    }
    System.out.println("Word count:" + count);
    return postings;
}

Advantages

Some advantages for this approach are:

You can get the number of occurrences of a given word simply by getting the size of the associated ArrayList.
Looking up the number of times a given word occurs in a page is relatively easy.

Optimizations

Current optimizations:

Ignoring common words (stop words).
Stemming words to their roots and storing those.
Ignoring common Wikipedia tags that aren't English words (included in stop word list such as: lt, gt, ref .. etc).
Ignoring text within < > tags such as: <pre>, <div>

Limitations

Array lists become incredibly large with number of occurrences for words, the major disadvantage of this approach comes when an array list has to grow. A new array list is created and the items from the previous array list need to be copied into the new array list. This could be a possible performance bottleneck. Would a Linked list make more sense here? As we are simply adding more occurrences and not reading the occurrences. This would also mean that since linked lists do not rely on an array as their underlying data structure they can grow without bounds and do not need to be replaced when they are too large.

Alternative approaches

I have considered dumping the counts for each word into a database like MongoDB after each page has been processed and then append the new occurrences. It would be: {word : [occurrences]} and then let the GC clean postings HashMap after each page has been processed.

I've also considered moving the pages loop into the index() method such that GC can clean up getPostings() before a new page. Then merging the new postings after each page but I don't think that will alleviate the memory burden.

As for the hash maps would a tree map be a better fit for this situation?

Execution

On my machine this program runs on all 4 cores using 90 - 100% and takes about 2-2.5GB RAM. It runs for over an hour and a half then: GC Out of memory.

I have also considered increasing the available memory for this program but it needs to run on my instructors machine as well. So it needs to operate as standard without any "hacks".

I need help making considerable optimizations, I'm not sure what else would help.

Side note on code quality: you are doing way too many things within one single method. I wholeheartedly recommend you to study "clean code" by R. Martin. Meaning: you spent **a lot** of time writing up this question - it would be very wise to spend the some amount of time when writing code. Besides: unfortunately your question is still a bit "very broad"; you might try to focus on "smaller" aspects within your design; instead of bringing up the whole thing and asking for "general help" improving that. — GhostCat, Oct 10 '15 at 11:46
You start by giving out the advantages of your approach, but you forgot to explain what your approach was. Do you expect us to try to read your code and fathom out exactly what you are trying to achieve with it? What is the code supposed to achieve, and what are the algorithms and data structures it uses? — RealSkeptic, Oct 10 '15 at 11:48
Consider simply increasing the heap limit for the JVM with the -[Xmx parameter](http://stackoverflow.com/a/14763095/1362755). Or just representing words with probabalistic data structures (hashes with a few bytes length). You also will have to look into language processing library because your approach won't work for logographic languages which don't separate words with whitespaces (e.g. chinese). — the8472, Oct 10 '15 at 11:51
@Jägermeister Thank you for your comment, you are right! I will have a look at Clean Code, thank you for the recommendation. I actually have Clean Coder on my desk right now :P — Warosaurus, Oct 10 '15 at 12:04
@RealSkeptic Sorry but you are wrong. I start out by explaining the problem and then give the code base, which I am hoping that any helpful member of this community would have a look over. I then go on to point out specifics of it and ask for both help and opinion. You are right that the approach could be more explanatory, that there could be more description. Thank you for your feedback! — Warosaurus, Oct 10 '15 at 12:14

score 2 · Accepted Answer · answered Oct 10 '15 at 13:22

TL;DR Most likely your data structure won't fit in memory, no matter what you do.

Side note: you should actually explain what your task is and what your approach is. You don't do that and expect us to read and poke in your code.

What you're basically doing is building a multimap of word -> ids of Wikipedia articles. For this, you parse each non-redirect page, divide it into single words and build a multimap by adding a word -> page id mapping.

Let's roughly estimate how big that structure would be. Your assumption is around 4 millions of words. There's around 5 millions of articles in EN Wikipedia. Average word length in English is around 5 characters, so let's assume 10 bytes per word, 4 bytes per article id. We're getting around 40 MB for words (keys in the map), 20 MB for article ids (values in the map).
Assuming a multihashmap-like structure you could estimate the hashmap size at around 32*size + 4*capacity.

So far this seems to be manageable, a few dozen MBs.

But there will be around 4 millions collections to store ids of articles, each will be around 8*size (if you'll take array lists), where the size is a number of articles a word will be encountered in. According to http://www.wordfrequency.info/, the top 5000 words are mentioned in COCAE over 300 million times, so I'd expect Wikipedia to be in this range.
That would be around 2.5 GB just for article ids just for 5k top words. This is a good hint that your inverted index structure will probably take too much memory to fit on a single machine.

However I don't think that the you've got problems with the size of the resulting structure. Your code indicates that you load pages in memory first and process them later on. And that definitely won't work.

You'll most probably need to process pages in a stream-like fashion and use some kind of a database to store results. There's basically a thousand ways to do that, I'd personally go with a Hadoop job on AWS with PostgreSQL as the database, utilizing the UPSERT feature.

score 1 · Answer 2 · answered Oct 10 '15 at 13:08

ArrayList is a candidate for replacement by a class Index you'll have to write. It should use int[] for storing index values and a reallocation strategy that uses an increment based on the overall growth rate of the word it belongs to. (ArrayList increments by 50% of the old value, and this may not be optimal for rare words.) Also, it should leave room for optimizing the storage of ranges by storing the first index and the negative count of following numbers, e.g.,

..., 100, -3,...   is index values for 100, 101, 102, 103

This may result in saving entries for frequently occurring words at the cost of a few cycles.

Consider a dump of the postings HashMap after entering a certain number of index values and a continuation with an empty map. If the file is sorted by key, it'll permit a relatively simple merge of two or more files.

This a very interesting answer, thank you. I'll definitely have a look into implementing something like this. — Warosaurus, Oct 10 '15 at 14:08