Rather than trying to keep it all in memory, do this in multiple passes.
First, create two files, one for single words, and one for bigrams.
Now, go through your text sequentially. As you read each word, output it to the single-words file. Combine it with the previous word and write the pair to the bigrams file. For example, given the sentence "the point is that there is no point, making the entire conversation pointless," the single-word file would contain one word per line. The bigrams file would contain:
the point
point is
is that
that there
there is
...
Now, using the sort utility provided by your operating system, sort each file. That groups identical words together.
Then, write a program that reads the file line-by-line, counting identical lines. As you get the total count of each word, write a corresponding file that contains word,count
. So if you have:
apple
apple
banana
cherry
cherry
cherry
Then your output would be:
apple,2
banana,1
cherry,3
Do the same thing with the bigrams file.
Finally, load your single-words file into a map or dictionary, indexed by word with the value being the count. Three million unique words should fit. If not, you could put them into a database. Something like SQLite would work really well.
Then start reading your bigrams file. Each line contains the bigram and its count. You can do the calculation and make the decision then whether you want to treat it as a single word, or you can output the bigram with its count and score to a separate file, and make the decision later.
You can reduce the size of the intermediate files created in the first pass by keeping some stuff in memory. Rather than writing each word and bigram to the intermediate file immediately, keep two dictionaries in memory, and limit their size. When a dictionary fills up, write the words and counts to the disk, and clear the dictionary. That way, rather than having hundreds of thousands of individual "the" words in the file, you end up with just a handful of "the,100000" entries.
Reducing the size of the intermediate files will increase sort speed. In the second step, when you're removing duplicates, you add the counts for each entry rather than just adding one for each entry.
Doing this in multiple passes makes things easier because it reduces required memory, and each step is almost trivially simple. Sure, it's not as fast as a single-program solution. But if it's an infrequent thing then who cares if it takes a little extra time?
Another benefit is that this this solution is quite scalable. I've done something very similar on my laptop (8 GB memory), doing word and bigram counts against a download of the entire English Wikipedia. Took a while (a few hours), but worked well.