8

Example, I have billions of short phrases, and I want to clusters of them that are similar.

> strings.to.cluster <- c("Best Toyota dealer in bay area. Drive out with a new car today",
                        "Largest Selection of Furniture. Stock updated everyday" , 
                        " Unique selection of Handcrafted Jewelry",
                        "Free Shipping for orders above $60. Offer Expires soon",
                        "XXXX is where smart men buy anniversary gifts",
                        "2012 Camrys on Sale. 0% APR for select customers",
                        "Closing Sale on office desks. All Items must go" 
                         )

assume that this vector is hundreds of thousands of rows. Is there a package in R to cluster these phrases by meaning? or could someone suggest a way to rank "similar" phrases by meaning to a given phrase.

sgt pepper
  • 267
  • 2
  • 4
  • 9
  • 1
    How do you propose to define "meaning"? Which ones of your example phrases should be clustered together? – tripleee Jan 26 '12 at 15:32

2 Answers2

9

You can view your phrases as "bags of words", i.e., build a matrix (a "term-document" matrix), with one row per phrase, one column per word, with 1 if the word occurs in the phrase and 0 otherwise. (You can replace 1 with some weight that would account for phrase length and word frequency). You can then apply any clustering algorithm. The tm package can help you build this matrix.

library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Going off of Vincent's suggestion there's a dissimilarity stat in the tm package that takes numerous distance arguments including "pearson". You could use some sort of level of similarity/dissimilaerty and select only the sentences that meat the set criteria. – Tyler Rinker Jan 26 '12 at 16:54
  • @TylerRinker, thanks for your question. I am thinking of mostly phrases related in meaning. In my example, "closing sale on office desks.." and "Largest Selection of Furniture..." to be clustered together (along with possibly others) – sgt pepper Jan 27 '12 at 05:07
  • If this approach does not work (you would need, for instance, many sentences with both the "desk" and "furniture" words to automatically identify them as being related), you can either add some knowledge about the meaning of the words (there is a `wordnet` package, that knows that a desk is a piece of furniture) or manually tag some of of the sentences (put them in different classes, e.g., "cars", "furniture", "travel", "food", etc.) and use them as a training set to automatically tag the rest of the data. – Vincent Zoonekynd Jan 27 '12 at 05:18
  • 1
    Similar discussion on SE [link](http://stats.stackexchange.com/questions/7115/semantic-distance-between-excerpts-of-text) but not necessarily in R – Tyler Rinker Jan 28 '12 at 06:42
  • @Vincent, which clustering algorithm did you end up using for this? I have the same exact problem. – tatsuhirosatou Nov 25 '12 at 01:05
  • @climatewarrior: My answer used hierarchical clustering (`hclust`), but you can try other algorithms: they are listed in the [clustering task view](http://cran.r-project.org/web/views/Cluster.html). – Vincent Zoonekynd Nov 27 '12 at 11:16
1

Maybe looking at this document: http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment could help, it uses R and looks at market sentiment for airlines using twitter.

aatrujillob
  • 4,738
  • 3
  • 19
  • 32
  • that is an interesting approach but appears more suited for classification (e.g, good/bad, +ve/-ve) and not for the clustering / meaning-based similarity metric that I am interested in. – sgt pepper Jan 26 '12 at 05:57
  • @sgtpepper Perhaps the package tm could be a good place to start looking. – aatrujillob Jan 26 '12 at 06:11