How to find a word in large word list (vocabulary) with descent memory consumption and look-up time?

Question

Problem

[Here follows a description of what the app should do under which constrains]

I want a data-structure that searches if a string exists in a 250,000 word-list, while using only a fair amount of ram and keeping the time it takes to load this data-structure into ram small (let's say 0-8 seconds). The time it takes to find a word should also be quick (let's say 0 to 0.5 second), but ram usage is more important. It should also be possible to create multiple games (more on what this game is about at the title "use") without needing significant more memory.

It would also be highly valuable to know which words start with a string, but not enough so to sacrifice load-time by many seconds.

Use

It is for an Android offline game. Limited ram is available. The maximum amount of ram an Application can use according to this post is between 16-32mb ram depending on the device. My empty Android Application already uses about 17mb (using Memory Monitor in Android Studio). My android device caps the ram usage off at 26mb, leaving me at about 8mb of free space for my whole Activity.

Options I tried

They all seem doomed in different ways.

Hashmap - Read all words into a hash-map object.

1.1 initialize speed: slow to read each word into the Hash-map with 23 seconds.

1.2 ram usage: uses significant amount of ram, although I forgot how much exactly.

1.3 search speed: Finding if a word existed in the list was quick of course.

1.4 narrowing down on possible words (optional): slow, needs to go through the whole hash-map and delete them one by one. Also because it's using deletion, multiple games won't be able to be played using the same instance of the hash-map. Too much memory would be taken when adding more games, making narrowing down on the possible words therefor impossible.
Trie - Implement a RadixTree & You can see my implementation here.

2.1 initialize speed: slow to read each word into the RadixTree with 47 seconds.

2.2 ram usage: uses significant amount of ram, so much that Android is suspending threads a couple of times.

2.3 search speed: Finding if a word existed in the list was quick.

2.4 narrowing down on possible words (optional): Ultra fast since only a reference to a node in the tree is needed to then find all possible words as its children. You can play a lot of games with narrowing down the possible words since an extra game requires only a reference to a node in the tree!
Scanner - Go through the word-file sequentially

3.1 initialize speed: none.

3.2 ram usage: none.

3.3 search speed: about 20 seconds.

3.4 narrowing down on possible words (optional): can't be done realistically.

simple code:

String word;
String wordToFind = "example";
boolean foundWord = false;

while (wordFile.hasNextLine()) {
    word = wordFile.nextLine();
    if(word.equals(wordToFind)) {
        foundWord = true;
        break;
    }
}

test.close();

Options I thought of:

Long-binary-search-tree: Converting the word-list to a list of longs then reading these in and doing a binary search on them.

1.1 initialize speed: probably the same as a hash-map or little less with about 20 seconds. However I hope calling Array.sort() does not take too much time, no idea as of yet.

1.2 ram usage: if you only account 12 letter words or lower with a 26 letter alphabet you need 5 bits (2^5= 32) to encode a string. An array of longs would need then 250,000*8 bits = around 2mb. Which is not too much.

1.3 search speed: Arrays.binarySearch()

1.4 narrowing down on possible words (optional): Narrowing down on possible words could be possible but I am not sure how. According to a comment on this post.
Hashmap with storage - Creating a hashfunction that maps a word to an index number of the word-list file. Then accessing the file at this specific location and look from here to find if a word exists. You can make use of the ordering of the alphabet to determine if you can still find the word since the word-list is in natural order.

2.1 initialize speed: not needed (since I need to put every word at the right index beforehand.)

2.2 ram usage: none.

2.3 search speed: fast.

2.4 narrowing down on possible words (optional): not possible.

Specific questions I have

Are the options I have thought of in the "Options I have thought of" section viable options or are there things I missed yet which would make them not possible to implement?
Are there options I have not thought of which are better/equal in performance?

End remarks

I have been stuck at this for about a week now. So any new ideas are more than welcome. If any of my assumption above are incorrect I would also be pleased to hear about them.

I made this post this way so others could learn from them as well, either by seeing my mistakes or seeing what does work in the answers.

Why not just use a SQLite database? "The maximum amount of ram an Application can use according to this post is between 16-32mb ram depending on the device" -- you may wish to consider reading the accepted answer on that question. "My empty Android Application already uses about 17mb (using Memory Monitor in Android Studio)" -- AFAIK, that is not reporting your heap size, and so you are comparing apples and oranges. — CommonsWare, Apr 28 '15 at 11:56
In the beginning of my journey I searched if using SQLite was a good idea. I then got back the answer that it was overkill and would not be as fast as a simpler data-structure could provide. Is this not true? I did read the answer to that post of course, but I have little experience on heap sizes and how I can or can't use them. I will look into it. — Joop, Apr 28 '15 at 12:04
"would not be as fast as a simpler data-structure could provide" -- you do not need it to be fast. You need it to be fast enough. According to you, 500ms is fast enough, and SQLite should have no problem with that, given that you put an index on the column with the word. "Is this not true?" -- you would have been done with your work by now. — CommonsWare, Apr 28 '15 at 12:12
I do get your point. The 500ms mark indicates the point in which I consider my program working, not working smoothly or good. I wanted to find a good solution. I will try to implement a SQLite database and add it to the options I tried once done. I first also tried to avoid it since the optional feature [finding the words that are still possible given a string] would be inefficient for SQLite as well, but that feature became optional only because I thought it would be too much to ask while keeping the amount of ram low. — Joop, Apr 28 '15 at 12:25
"I first also tried to avoid it since the optional feature [finding the words that are still possible given a string] would be inefficient for SQLite as well" -- it may not be too bad if you use [FTS3 or FTS4 indexing and prefix queries](http://www.sqlite.org/fts3.html#section_3). — CommonsWare, Apr 28 '15 at 12:31
I will try that and see how that performs, I will eventually update this post to post the performance the same way I did here with the other datastructures (so not very good but it's a start). — Joop, Apr 28 '15 at 12:40

score 3 · Accepted Answer · answered Apr 28 '15 at 15:20

3

This sounds like an ideal use for a bloom filter. If you're willing to allow the risk of something being falsely considered a word, you can condense your wordlist into an amount of memory as small or as large as you're willing to make it.

answered Apr 28 '15 at 15:20

jsheeran

2,912
2
17
32

1

Thanks for the suggestion but I do not want something to be falsely identified as a word :( – Joop Apr 28 '15 at 15:29
1

Have a look at [this calculator](http://hur.st/bloomfilter?n=250000&p=1.0E-10). In this example, using 1.43MB of memory, you have a 1 in 10 billion chance of a false positive. And if it turns out that people are exploiting the fact that "XQFMJB" is a word, you can change the filters periodically and have the program download a new floppy-disk-sized dictionary. – jsheeran Apr 28 '15 at 15:44
1

Oh that sounds very promising. I will check this out as well! – Joop Apr 28 '15 at 22:07

score 2 · Answer 2 · answered May 30 '15 at 20:17

I had this same issue and ended up going with an "on-disk" trie. That is, I encode the data structure into a single file using byte offsets instead of pointers (packing the nodes in reverse order, with the "root" node being the last written).

It is fast to load by simply reading the file into a byte array, with trie traversal using offset values the same way it would pointers.

My 200K word set fits in 1.7 MB (uncompressed) with a 4 byte value in each word terminating node.