How to count every ocurrence of a bigram in a large text corpora

Question

I have a large amount of text that includes wikipedia articles, news articles, etc. About 1.5 billion words total, and about 3 million unique words.

What I want to do is decide when to count to consecutive words as a single word, for example "orange juice" should probably be treated as a single word. To decide if a pair of words should be treated as a single word, I need to know how many times the bigram occurs, and how many times each of the words in the bigram occurs. bigramCount/(word1Count*word2Count) > threshold The problem is that a variable containing all the bigram counts of my text would occupy more memory than my computer ram size.

What I tried doing is:

1. Count single words
2. For every single word:
    1. Count every ocurrence of a bigram that starts with that word
    2. Decide, applying the formula, which of those bigrams should be treated as a single word.

That way it's easier on the memory, but it takes too long to do that. I'm currently doing that but it has been running for at least a day now, so I'm trying to come up with a better way of doing this.

Any idea?

This is a case of divide and conquer. Split the set into manageable chunks, run your algorithm over the chunks; and aggregate the data. Repeat initial runs using the aggregated totals, if necessary. — , Jan 10 '19 at 00:29
@Strom If you mean count ocurrences in every chunk and then aggregate them that doesn't solve the memory problem because the final result would be the same size. If you mean count ocurrences in every chunk and then iterate for each word over every chunk I tried that, but it's really slow too. — Cristian Desivo, Jan 10 '19 at 00:54

score 0 · Answer 1 · answered Jan 10 '19 at 03:58

Break your data into evenly sized 100 - 200 MiB chunks. Run your algorithm. Store the top 85% (most encountered combinations) bigram possibilities comma separated in a file (1.csv). Sort the file by first word.

Repeat unto different files(2,3,4...), until there is no more data.

Correlate (combine the like value counts) for files 1 and 2 into a new CSV file 1a. Correlate for files 3 and 4 into a new CSV file 2a. Repeat for the rest of the files. If there are an odd number of files, Correlate the last file with a random file 1..n) Then correlate the 1a, 2a .. files.

Continue until there is a single file with your results.

This procedure represents a binary tree solution. It is the optimal solution in terms of run-time, but it introduces a spacial bias. Pairs occurring more often closer together or evenly spaced throughout all samples, will have a higher weight on the final product.

The most complete solution is to aggregate the expansion of all the levels fully. For example, (Collate 1 and 3 => 1b, 1 and 4 => 1c ... 2 and 1 =>2b, 2 and 3 => 2c, 2 and 4 => 2d...) ... and then in the next step combine 1a and 1b..., 2a and 2b... This is an exponential solution(slow).

To balance performance AND reduce the complexity and decrease the bias, you can randomize the pairings at the lower levels:

For example: Randomize the order of the chunks as they are analyzed at each level. Make sure the algorithm only outputs each pair a single time.

If you the randomize the selections at the bottom of the tree multiple times (approx 1/2 of a the full expansion as described above), while eliminating duplicate pairs from all previous iterations, the resulting accuracy improves greatly in the above layers.

If you repeat this randomization for the second and third levels, (If full analysis is still not possible) beyond the third level, significant performance gains are not likely possible, due to the law of diminishing returns.

I would recommend using a pre-built bigram database, or at least restricting, at the top level, the bigram candidates to be (noun|adjective, noun). Otherwise, you may get the most used noun/verb combination(in most other modern American English data sets, would be "I am" or "I have").

I will have this in mind, but I would rather not introduce the extra bias if possible. — Cristian Desivo, Jan 10 '19 at 20:58
The exponential solution introduces no bias other than that present within the data set. It just takes the same complexity as the original problem + log n time for aggregation and storage. — , Jan 12 '19 at 20:45

Jim Mischel · Accepted Answer · 2019-01-11T15:20:18.750

Rather than trying to keep it all in memory, do this in multiple passes.

First, create two files, one for single words, and one for bigrams.

Now, go through your text sequentially. As you read each word, output it to the single-words file. Combine it with the previous word and write the pair to the bigrams file. For example, given the sentence "the point is that there is no point, making the entire conversation pointless," the single-word file would contain one word per line. The bigrams file would contain:

the point
point is
is that
that there
there is
...

Now, using the sort utility provided by your operating system, sort each file. That groups identical words together.

Then, write a program that reads the file line-by-line, counting identical lines. As you get the total count of each word, write a corresponding file that contains word,count. So if you have:

apple
apple
banana
cherry
cherry
cherry

Then your output would be:

apple,2
banana,1
cherry,3

Do the same thing with the bigrams file.

Finally, load your single-words file into a map or dictionary, indexed by word with the value being the count. Three million unique words should fit. If not, you could put them into a database. Something like SQLite would work really well.

Then start reading your bigrams file. Each line contains the bigram and its count. You can do the calculation and make the decision then whether you want to treat it as a single word, or you can output the bigram with its count and score to a separate file, and make the decision later.

You can reduce the size of the intermediate files created in the first pass by keeping some stuff in memory. Rather than writing each word and bigram to the intermediate file immediately, keep two dictionaries in memory, and limit their size. When a dictionary fills up, write the words and counts to the disk, and clear the dictionary. That way, rather than having hundreds of thousands of individual "the" words in the file, you end up with just a handful of "the,100000" entries.

Reducing the size of the intermediate files will increase sort speed. In the second step, when you're removing duplicates, you add the counts for each entry rather than just adding one for each entry.

Doing this in multiple passes makes things easier because it reduces required memory, and each step is almost trivially simple. Sure, it's not as fast as a single-program solution. But if it's an infrequent thing then who cares if it takes a little extra time?

Another benefit is that this this solution is quite scalable. I've done something very similar on my laptop (8 GB memory), doing word and bigram counts against a download of the entire English Wikipedia. Took a while (a few hours), but worked well.

This could work, but what do you mean by "Now, using the sort utility provided by your operating system, sort each file."? Would this work in a reasonable time with the bigrams text file that would have a size of 14 gb? Do you know how to do that in linux? — Cristian Desivo, Jan 10 '19 at 20:55
The Linux command would be `sort outputfile.txt`. You might want to look up options, but I think that'll work just fine. I don't know how long it would take. 14 GB isn't all that large. You could try it with some smaller files and extrapolate. From what I recall of the Linux sort, it's reasonably performant on simple string sorting. — Jim Mischel, Jan 10 '19 at 21:52
@CristianDesivo You might be interested in https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort. Also, do a Google search for [linux sort performance]. — Jim Mischel, Jan 10 '19 at 21:53
`sort -S 50% --parallel=4 file.txt >output.txt` Change `4` to however many CPU cores you have. — Jim Mischel, Jan 10 '19 at 21:59
Having tried this solution I have to say that it worked better than I could have expected, this is probably the best approach to do this. — Cristian Desivo, Jan 12 '19 at 03:14
@CristianDesivo I don't know if it's the *best*, but it's a good one. It's the same general technique pioneered 50+ years ago for batch processing of transactions, and extended and popularized by MapReduce and Hadoop. It's easy to understand, simple to implement, and scales really well. — Jim Mischel, Jan 12 '19 at 03:25

How to count every ocurrence of a bigram in a large text corpora

2 Answers2