R: clustering documents

Question

I've got a documentTermMatrix that looks as follows:

      artikel naam product personeel loon verlof    
 doc 1    1       1    2        1        0    0     
 doc 2    1       1    1        0        0    0    
 doc 3    0       0    1        1        2    1   
 doc 4    0       0    0        1        1    1

In the package tm, it's possible to calculate the hamming distance between 2 documents. But now I want to cluster all the documents that have a hamming distance smaller than 3. So here I would like that cluster 1 is document 1 and 2, and that cluster 2 is document 3 and 4. Is there a possibility to do that?

This question will be easier to answer and more useful to others if you include a reproducible example. See https://stackoverflow.com/help/how-to-ask and http://stackoverflow.com/q/5963269/134830 — Richie Cotton, Oct 27 '14 at 10:10

Karolis Koncevičius · Accepted Answer · 2014-10-27T11:52:51.463

I saved your table to myData:

myData
     artikel naam product personeel loon verlof
doc1       1    1       2         1    0      0
doc2       1    1       1         0    0      0
doc3       0    0       1         1    2      1
doc4       0    0       0         1    1      1

Then used hamming.distance() function from e1071 library. You can use your own distances (as long as they are in the matrix form)

lilbrary(e1071)
distMat <- hamming.distance(myData)

Followed by hierarchical clustering using "complete" linkage method to make sure that the maximum distance within one cluster could be specified later.

dendrogram <- hclust(as.dist(distMat), method="complete")

Select groups according to the maximum distance between points in a group (maximum = 5)

groups <- cutree(dendrogram, h=5)

Finally plot the results:

plot(dendrogram)  # main plot
points(c(-100, 100), c(5,5), col="red", type="l", lty=2)  # add cutting line
rect.hclust(dendrogram, h=5, border=c(1:length(unique(groups)))+1)  # draw rectangles

hclust

Another way to see the cluster membership for each document is with table:

table(groups, rownames(myData))

groups doc1 doc2 doc3 doc4
     1    1    1    0    0
     2    0    0    1    1

So documents 1st and 2nd fall into one group while 3rd and 4th - to another group.

R: clustering documents

1 Answers1