How to cluster a data frame with two variables and frequency of occurrence

Question

Say I have a data frame with the sample:

structure(list(V1 = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 
    6L, 6L, 6L, 7L, 7L, 7L, 9L, 9L, 9L, 13L, 13L, 13L, 15L, 15L, 
    18L, 22L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 
    26L, 27L, 27L, 28L, 32L, 32L, 32L, 32L, 32L, 36L, 36L, 36L, 36L, 
    36L, 36L, 36L, 37L, 37L, 37L, 37L, 37L, 37L, 38L, 38L, 38L, 38L, 
    38L, 39L, 39L, 39L, 39L, 40L, 40L, 40L, 41L, 41L, 42L, 45L, 45L, 
    45L, 45L, 47L, 47L, 47L, 48L, 50L, 50L, 51L, 53L, 53L, 54L), 
        V2 = c(2L, 7L, 20L, 3L, 5L, 6L, 7L, 13L, 15L, 18L, 19L, 20L, 
        4L, 5L, 6L, 7L, 9L, 12L, 6L, 9L, 12L, 13L, 15L, 18L, 7L, 
        9L, 13L, 15L, 18L, 9L, 20L, 44L, 12L, 27L, 44L, 15L, 18L, 
        58L, 16L, 18L, 19L, 23L, 27L, 28L, 29L, 32L, 45L, 47L, 50L, 
        51L, 52L, 53L, 54L, 55L, 28L, 29L, 29L, 45L, 47L, 53L, 54L, 
        55L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 38L, 39L, 40L, 41L, 
        42L, 43L, 39L, 40L, 41L, 42L, 43L, 40L, 41L, 42L, 43L, 41L, 
        42L, 43L, 42L, 43L, 43L, 47L, 53L, 54L, 55L, 53L, 54L, 55L, 
        49L, 51L, 52L, 52L, 54L, 55L, 55L), N = c(1L, 1L, 1L, 1L, 
        1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 
    -104L), class = c("data.table", "data.frame"))

How should I structure this in order to be able to cluster these using cor(), dist() and hclust()?

For your information V1 and V2 are material numbers. N = Frequency of how often they appear together in the same order. I can also change the N, to a number between 0 - 1 to represent the correlation, if that makes it easier.

To my knowledge I have to change it to a matrix first, that look like this, where V1 can be the rows and V2 the column and N the values. But I don't know

  1 2 3 4 ...
1 0 1 1 4
2 1 0 2 2
3 1 4 0 1
4 1 0 3 0
...

Based on your example data you want matrix `55 x 55` with values specified by `N`? — CPak, May 24 '18 at 12:41
Yes, if that makes it possible to use the clustering functions. — M.A, May 24 '18 at 12:44
I don't know why people downvote without a comment but please read [(1)](https://stackoverflow.com/help/how-to-ask) how do I ask a good question, [(2)](https://stackoverflow.com/help/mcve) how to create a MCVE as well as [(3)](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610) how to provide a minimal reproducible example in R. I.e., abstract from your real problem... — Christoph, May 24 '18 at 13:33

Juan Antonio Roldán Díaz · Accepted Answer · 2018-05-25T08:24:57.793

0

"To my knowledge I have to change it to a matrix first, that look like this, where V1 can be the rows and V2 the column and N the values."

You need a distance matrix to cluster analasys, and your definitión isn't a distance matrix.

I think that you want the materials that appear together to be closer and those that do not appear together farther away, so I propose that the distance between materials be 1 / N when they appear together and 2 when they do not. Then you would have something like this:

> ## df is your data.frame
> dd <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
> dd <- 1/dd
> dd[is.na(dd)] <- 2
> diag(dd) <- 0
> dd <- as.dist(dd)
> dd
     1   2   3   4   5   6   7   9  ...
2  1.0                                                                                                                                    
3  2.0 1.0                                                                                                                                
4  2.0 2.0 0.5                                                                                                                            
5  2.0 1.0 1.0 2.0                                                                                                                        
6  2.0 1.0 1.0 2.0 0.5                                                                                                                    
7  1.0 0.5 1.0 2.0 2.0 1.0                                                                                                                
9  2.0 2.0 0.5 2.0 1.0 1.0 0.5                                                                                                            
...
> hc <- hclust(dd)
> plot(hc)

Example with similarity matrix

> ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
> ss <- ss/max(ss, na.rm = TRUE)
> ss[is.na(ss)] <- 0
> diag(ss) <- 1
> ss
      1   2   3 4   5   6   7   9  ...
 1: 1.0 0.5 0.0 0 0.0 0.0 0.5 0.0 
 2: 0.5 1.0 0.5 0 0.5 0.5 1.0 0.0 
 3: 0.0 0.5 1.0 1 0.5 0.5 0.5 1.0 
 4: 0.0 0.0 1.0 1 0.0 0.0 0.0 0.0 
 5: 0.0 0.5 0.5 0 1.0 1.0 0.0 0.5 
 6: 0.0 0.5 0.5 0 1.0 1.0 0.5 0.5 
 7: 0.5 1.0 0.5 0 0.0 0.5 1.0 1.0 
 8: 0.0 0.0 1.0 0 0.5 0.5 1.0 1.0 
 9: 0.0 0.0 0.5 0 0.5 0.0 0.0 0.5 
10: 0.0 0.5 0.0 0 0.5 0.5 0.0 0.0  
...
> dd <- as.dist(1 - ss)
> dd
     1   2   3   4   5   6   7   9  ...
2  0.5                              
3  1.0 0.5                          
4  1.0 1.0 0.0                     
5  1.0 0.5 0.5 1.0                                                                                                                        
6  1.0 0.5 0.5 1.0 0.0           
7  0.5 0.0 0.5 1.0 1.0 0.5         
9  1.0 1.0 0.0 1.0 0.5 0.5 0.0  
...
> hc2 <- hclust(dd)
> plot(hc2)

PAM examples:

> # hclust - 5
> cl <- cutree(hc2, 5)
> summary(as.factor(cl))
  1   2   3   4   5 
562   1   1   2   1 
> 
> # pam - 5 with dd
> pam1 <- pam(dd, 5)
> summary(as.factor(pam1$clustering))
  1   2   3   4   5 
402 105  22  21  17 
> 
> # pam - 5 with sqrt(ss)
> dd2 <- as.dist(1 - sqrt(ss))
> pam2 <- pam(dd2, 5)
> summary(as.factor(pam2$clustering))
  1   2   3   4   5 
362  95  23  61  26

edited May 25 '18 at 08:24

answered May 24 '18 at 13:31

Juan Antonio Roldán Díaz

2,017
8
22

Thank you for your help. In the real data frame I have, the highest number of appearances is 489. What if I give them a correlation score, so lets say the pair with 489 appearances is = 1. If another pair have i.e. 200 appearances, the correlation score is 200/489 = 0.4089. Wouldn't you then be able to calculate distances, and then cluster them? – M.A May 24 '18 at 13:55
The correlation score matrix that you are defining is not a distance matrix, it is a similarity matrix. You can do 1 - "your correlation score" for get a dissimilarity matrix, that you can use how distancie matrix. I updated the example. – Juan Antonio Roldán Díaz May 24 '18 at 15:48
Hmmm I get the error: Error in .(V1 = V2, V2 = V1, N) : could not find function "." – M.A May 24 '18 at 16:18
Have you loaded the data.table package? – Juan Antonio Roldán Díaz May 24 '18 at 16:19
In my computer work, is df a data.table?? If don't work, try this: ss <- dcast(rbind(data.frame(V1 = c(df$V1, df$V2), V2 = c(df$V2, df$V1), N = c(df$N, df$N))), V1~V2, value.var = "N")[, -1] – Juan Antonio Roldán Díaz May 24 '18 at 16:23
It was because it was not a data.table. Sorry i'm a beginner in R. It seem to work now with the full data set. I wanted to use the "centroid" method, though it looks really weird. Do you know how I can print the clusters into a file that can be read by excel? – M.A May 24 '18 at 16:32
Yes. With `hclust` you get the dedogram. To get the cluster of each item, first you have to set how many clusters you want, and then you can extract them with the `cutree` function. And save this in txt to can open it in Excel; example 5 cluster: write.csv(cutree(hc2, 5), file = "myCl.txt"). – Juan Antonio Roldán Díaz May 24 '18 at 16:46
Thanks, but still it seems wrong. I tried different clustering methods, and with different number of clusters. All shows one cluster with almost all materials in them, and other clusters containing only 1 material :/ – M.A May 24 '18 at 17:10
I made a pastebin with the full dataset, if that helps you to see whats wrong: https://pastebin.com/MHYYDHEd – M.A May 24 '18 at 17:20
I'm sorry M.A. but you have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed is agruped. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one. I modify the example. – Juan Antonio Roldán Díaz May 25 '18 at 08:17
Thanks Juan. You have been a great help! Using (2 - sqrt(ss)) or (3 - sqrt(ss)) doesn’t change anything, if thats what you meant? Also is there a good way to check the optimal number of clusters, when using the pam function? Again thanks for your help so far! – M.A May 25 '18 at 09:01
Also ss is only 567x567, but I have 696 unique material numbers, so it should be 696x696, hmmm... – M.A May 25 '18 at 11:58
Do you think you can help, or should I ask it as a new question? – M.A May 25 '18 at 14:53
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/171833/discussion-between-m-a-and-juan-antonio-roldan-diaz). – M.A May 26 '18 at 14:27

How to cluster a data frame with two variables and frequency of occurrence

1 Answers1