0

Looking for some advice on generating a list of commonly used words and phrases from a bunch of entries in a nosql database. Basically we have a bunch of posts made by someone and we want to tell them "Hey there. You use these words / phrases a lot".

I'm a bit stumped on this one.

My application is ruby on rails, backbone-js and redis.

stueynet
  • 1,102
  • 1
  • 11
  • 11
  • Identifying sequences of letters is simple (which are not exactly words), but if you want to extract real words and phrases, you need to do natural language processing and data mining. It is not that easy. – sawa May 27 '13 at 22:02
  • I would integrate a word stemming library like this [ruby-stemmer gem](https://github.com/aurelian/ruby-stemmer) for the languages you need to support (just English?). I haven't used the library. You probably want to do the words count calculation offline per user and store the results in a cache, you could use redis to store a hash of the words and counts for example. – Andrew Atkinson May 28 '13 at 00:43

1 Answers1

0

Since it's not clear how the posts are stored, I'll just assume you can get an array of all the posts.

A simple algorithm to find the most common uncommon words would be the following: Iterate over the array of all the posts, and then strip the post from anything but the words and split it into words. Go over all the words in the entry and add 1 to the amount of times you've seen that word. Once that's done for all the words in all your entries, you'll have a hash with the number of occurrences of all the words. Remove the most common words, here's an example of 100 common words. You should probably use more in your application. Sort them by the number of occurrences and you'll have the most commonly occurring words.

It's implemented here. It doesn't handle cases such as posts being post, which you might want. You could look into how Rails implements String#singular to get this behavior.

If you wanna find commonly used phrases it gets more interesting, you'd probably have to use some kind of natural language processing as @sawa pointed out in a comment. I can't come up with a solution that is fast enough off the top of my head.

Sirupsen
  • 2,075
  • 2
  • 19
  • 24