Binary Search a file with varying word length

Question

I am making a crude Java spellchecker that takes an article and a pre-sorted dictionary file. The length of the article's words vary, and therefore I tried making a stack that takes in the words given by the file.

This unfortunately didn't work because the stack ran out of space (even with the shortened dictionary file) and due to performance concerns, I decided to read from the text file directly.

The issue is that the file doesn't have words of the same length. Because that the length of the words vary, I cannot and should not expect that the length of a single word is useful in determining both how many words are in the dictionary file from how large that file is.

Because of this I am stuck. I need to execute a binary search on that file in order to make the spellcheck program work. But I can't execute a binary search if there's no clear way to treat the file as an array, especially when the array is just too large to put into the program's memory.

What should I do?

Load the file into memory. Add more memory to Java, if needed. — Andreas, Sep 27 '18 at 20:15
This SO post explains how to increase JVM memory: https://stackoverflow.com/questions/1565388/increase-heap-size-in-java Load your dictionary file into a Set implementation, check each word in the document being spell checked with existence in the Set, if not in the Set, spelling is wrong (this is much faster than binary search) — StvnBrkdll, Sep 27 '18 at 20:30
Staying with your idea, I would suggest writing that file using fixed-width records, the size of the longest word in your file. Wastes some space, but binary search over it is straightforward... — moilejter, Sep 27 '18 at 21:32
I'm thinking of closing this question (if I can for that matter) because I think that in hindsight that this might be too broad. Thank you for giving some direction for solving this problem. — JoseAyeras, Sep 28 '18 at 01:07

Maximilian C. · Accepted Answer · 2018-09-28T09:38:25.327

The Oxford English dictionary suggets that there are about ~250,000 words that you need to consider for your dictionary (not factoring in words only used in highly specific domain). This is an important design information for you.

I see some solutions:

1) Simply using a HashSet<>

In theory, you can use a HashSet<> for this amount of elements (this SO post discusses theoretical limits of HashSets and other in detail).

However, this brings (as you have observed) a couple of problems:

It takes time (on every application start up) to read this into the RAM
It eats up RAM

Of course you can increase the heap size of your JRE but there is a natural limit to that (@StvnBrkddll linked a SO post that describes this perfectly in the comments)

2) Using a Database

I'd consider storing the valid words in a (relational) database:

You don't need to load everything on application start up
It does not weigh as heavy on your RAM as option (1)
It gives you more options, if you want to change your application to also suggest similar words without typos to the user (e.g. if you use PostgreSQL you could achiev pg_trgm)

It has some drawbacks though:

You mentioned your application is simple: Having a database system adds complexity

Binary Search a file with varying word length

1 Answers1