I have been running Louvain community detection in R using igraph, with thanks to this answer for my previous query. However, I found that the cluster_louvain
method seemed to do something strange with assigning group membership, which I think was due to an error in how I imported my data. Whilst I think I resolved this I would like to understand what the problem was.
I ran louvain clustering on a 400x400 correlation matrix (i.e. correlation scores for 400 individuals). When I initially imported my data, my correlation matrix had the same individuals’ ID numbers (i.e. vertex numbers) for both the row and column headings, as below:
1 2 3 4 ... 400
1 0 0.8 0.7 0.1
2 0.8 0 0.6 0.3
3 0.7 0.6 0 0.9
4 0.1 0.3 0.9 0
...
400
This correlation matrix was saved in a "Correlations.csv" file, which I imported using read.csv
. I then used the below code to convert it to a distance matrix, remove correlations below a certain threshold, turn it into an adjacency matrix for igraph, and run cluster_louvain: (This code is also provided in the answer here).
correlationmatrix <- read.csv("Correlations.csv", header = TRUE,
row.name = 1, check.names = FALSE)
distancematrix <- cor2dist(correlationmatrix)
DM2<- as.matrix(distancematrix)
DM2[correlationmatrix < 0.33] = 0
G2 <- graph.adjacency(DM2, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain <- cluster_louvain(G2)
sizes(clusterlouvain)
Community sizes
1 2
200 200
I then wanted to get the cluster number beside each ID number, to know which individual belonged to each community. In list of vertex IDs, the membership beside them was listed as ‘1 2 1 2 1 2 1 2’, which obviously was not right (as we would not expect every alternate individual in the dataset to be assigned to a different community):
IDs_cluster <- cbind(V(G2)$name, clusterlouvain$membership)
IDs_cluster
ID Membership
1 1
2 2
3 1
4 2
5 1
6 2
…
400 2
From looking at other datasets I realised the problem might have been because the row headings in my correlation matrix were numerical. So I changed the correlation matrix so that the row headings were still the ID numbers, but the column headings were `V1-V400':
V1 V2 V3 V4 ... V400
1 0 0.8 0.7 0.1
2 0.8 0 0.6 0.3
3 0.7 0.6 0 0.9
4 0.1 0.3 0.9 0
...
40
I imported this as a .csv file and re-ran ‘cluster_louvain’, as below:
correlationmatrix_V <- read.csv("Correlations_withV.csv", header = TRUE,
row.name = 1, check.names = FALSE)
distancematrix_V <- cor2dist(correlationmatrix_V)
DM2_V <- as.matrix(distancematrix_V)
DM2_V[correlationmatrix_V < 0.33] = 0
G2_V <- graph.adjacency(DM2_V, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain_V <- cluster_louvain(G2_V)
Now when I reran cluster_louvain
, it generated a more sensible result of three clusters, with group membership to each cluster looking more like what we would expect:
sizes(clusterlouvain_V)
Community sizes
1 2 3
168 52 180
IDs_cluster <- cbind(V(G2_V)$name, clusterlouvain_V$membership)
View(IDs_cluster)
ID Membership
1 1
2 1
3 3
4 2
5 2
6 2
…
400 1
My question is: May it be possible to clarify what happened when using the same row and column headings, that meant group membership was assigned to alternate individuals (i.e. '1 2 1 2' down the ID list, as in the first example), but was resolved when changing the column headings to a non-numerical format (as in the second example)?
This may be a simple mistake in that when importing the .csv of the correlation matrix using ‘read.csv’ I did not use the correct settings, given my column headings were also numerical.
However, would like to understand why this meant ‘cluster_louvain’ assigned group membership in the way it did. I am posting this in case it may be useful if anyone makes the same mistake I did above. Any insights would be welcome, and thank you for any advice!