1

Are there any algorithms for efficiently finding the n most common k-word sequences of words in a large corpus of text?

So, for example, if n = 1 and k = 5, and my corpus is:

a b a b a c d e f a b a b a c c b c b a b c

Output would be a b a b a

TheSneak
  • 500
  • 6
  • 15
  • 1
    What is wrong with iterating every k word sentence and add 1 to a counter and then use a heap to get top n? – juvian Jun 05 '19 at 19:45
  • Thanks for suggesting this--I assume you mean keeping a hash table of k-word sentences and incrementing a counter each time one is seen, iterating through the whole corpus once. I might use it, but for a big corpus, doing this in RAM would be tough (the hash table would get big). If you mean to iterate through the corpus for each k-word sequence and maintaining a heap of fixed size the whole time, this would solve the memory problem, but now time complexity would be O(n^2) because we'd be iterating through the whole corpus once for each k-word sentence, right? I or maybe i misundertood. – TheSneak Jun 05 '19 at 20:29
  • As @juvian pointed out this can be done in O(N+k) time where N is the number of words in the corpus. See https://www.geeksforgeeks.org/find-top-k-or-most-frequent-numbers-in-a-stream/. – RobertBaron Jun 05 '19 at 21:36
  • 1
    Possible duplicate of [Find the 'n' most repeating words/strings in a huge file that does not fit in memory](https://stackoverflow.com/questions/21565880/find-the-n-most-repeating-words-strings-in-a-huge-file-that-does-not-fit-in-me) – juvian Jun 05 '19 at 22:18

1 Answers1

1

Here is a simple algorithm. Split the sequence of words in sequence of k words, at every word, one by line. For your example, we have:

a b a b a
b a b a c
a b a c d
b a c d e
a c d e f
c d e f a
d e f a b
e f a b a
f a b a b
a b a b a
b a b a c
a b a c c
b a c c b
a c c b c
c c b c b
c b c b a
b c b a b
c b a b c
c b a b c

Sort them:

a b a b a
a b a b a
a b a c c
a b a c d
a c c b c
a c d e f
b a b a c
b a b a c
b a c c b
b a c d e
b c b a b
c b a b c
c b a b c
c b c b a
c c b c b
c d e f a
d e f a b
e f a b a
f a b a b

It is now easy to find the k-sequence that is the most common. If the file is very large, you may have to split it, sort the parts, and then merge the parts.

RobertBaron
  • 2,817
  • 1
  • 12
  • 19