Techniques for finding common sequences of words in a large text corpus?

Question

Are there any algorithms for efficiently finding the n most common k-word sequences of words in a large corpus of text?

So, for example, if n = 1 and k = 5, and my corpus is:

a b a b a c d e f a b a b a c c b c b a b c

Output would be a b a b a

What is wrong with iterating every k word sentence and add 1 to a counter and then use a heap to get top n? — juvian, Jun 05 '19 at 19:45
Thanks for suggesting this--I assume you mean keeping a hash table of k-word sentences and incrementing a counter each time one is seen, iterating through the whole corpus once. I might use it, but for a big corpus, doing this in RAM would be tough (the hash table would get big). If you mean to iterate through the corpus for each k-word sequence and maintaining a heap of fixed size the whole time, this would solve the memory problem, but now time complexity would be O(n^2) because we'd be iterating through the whole corpus once for each k-word sentence, right? I or maybe i misundertood. — TheSneak, Jun 05 '19 at 20:29
As @juvian pointed out this can be done in O(N+k) time where N is the number of words in the corpus. See https://www.geeksforgeeks.org/find-top-k-or-most-frequent-numbers-in-a-stream/. — RobertBaron, Jun 05 '19 at 21:36
Possible duplicate of [Find the 'n' most repeating words/strings in a huge file that does not fit in memory](https://stackoverflow.com/questions/21565880/find-the-n-most-repeating-words-strings-in-a-huge-file-that-does-not-fit-in-me) — juvian, Jun 05 '19 at 22:18

RobertBaron · Answer 1 · 2019-06-05T21:31:34.227

1

Here is a simple algorithm. Split the sequence of words in sequence of k words, at every word, one by line. For your example, we have:

a b a b a
b a b a c
a b a c d
b a c d e
a c d e f
c d e f a
d e f a b
e f a b a
f a b a b
a b a b a
b a b a c
a b a c c
b a c c b
a c c b c
c c b c b
c b c b a
b c b a b
c b a b c
c b a b c

Sort them:

a b a b a
a b a b a
a b a c c
a b a c d
a c c b c
a c d e f
b a b a c
b a b a c
b a c c b
b a c d e
b c b a b
c b a b c
c b a b c
c b c b a
c c b c b
c d e f a
d e f a b
e f a b a
f a b a b

It is now easy to find the k-sequence that is the most common. If the file is very large, you may have to split it, sort the parts, and then merge the parts.

edited Jun 05 '19 at 21:31

answered Jun 05 '19 at 20:42

RobertBaron

2,817
1
12
19

This is useful when you want to have the count of all words but if you only need top k it's very inefficient – juvian Jun 05 '19 at 21:02
@juvian What do you mean by **top** k? k is given as the length of the sequence. – RobertBaron Jun 05 '19 at 21:06
Sorry top n. If n is small there are better methods – juvian Jun 05 '19 at 21:09

Techniques for finding common sequences of words in a large text corpus?

1 Answers1