The Google Play app store recently rolled out a new feature for the Android app, which groups common phrases that reviewers left. I was thinking about which algorithm might accomplish this with as little space and time as possible, but haven't yet come up with a satisfying answer.
The feature looks something like this:
This app has 10.000.000 reviews
"Addictive" : 1234 reviews
- "This is addictive!"
- "Can't put this addictive app down."
- ...
"Bad framerate": 12345 reviews
- "This app has a bad framerate on my device"
- "The main menu has a bad framerate"
- ...
The top 5 phrases are displayed (k=5). A phrase can have 5 words max (l_max=5). A popular app can have 10 million comments with 100 words each (n=10E9 words).
Trivial approaches
Hashmap
A simple algorithm for l_max=1 would be:
- Drop all words into a hashmap, counting their frequency (O(n) time, O(n) space)
- Select k top phrases O(nlogk) via heap, O(k) space
If l_max > 1, step 1 would create a hashmap that needs *O(n/l_max) space, which is too much.
Nussinov
The Nussinov algorithm (see Algorithm to find the most common substrings in a string) would be a great match, but it needs O(n**2) space, so we probably shouldn't use that one here.