2

My question deals with the application of graph clustering algorithms. Most times, I see that graphs are made by using nodes and edges within the data. For example, suppose we have social media data: each individual in the data could be represented as a node and the relationship between individuals could be represented as edges. Using this information, we could build a graph and then perform graph clustering algorithms (e.g. Louvain Clustering) on this graph.

Sometimes, graphs can also be made using distances between points. Distances between points can be thought of as edges. For example, in the Spectral Clustering algorithm, a KNN (k nearest neighbor) graph is made from the data and then the K-Means clustering algorithm is performed on this graph.

My question is this: Suppose we take the famous Iris data and remove the response variable ("Species"). Would it make any sense to create a graph of this Iris data in which each node corresponds to an individual flower and edges correspond to pairwise Euclidean distances between each points? Assuming this is a logical and correct approach, could graph clustering algorithms be then performed on this Iris graph?

Below, I have attempted to first create a graph of the Iris data using pairwise Euclidean distances (in R). I then performed Louvain Clustering and Infomap Clustering on the resulting graph. After that, I attempted to create a KNN graph of the Iris data and perform MST (minimum spanning tree) clustering on this KNN graph, as well as perform Louvain Clustering.

Could someone please provide an opinion on what I have done? Is this intuitive and does it make mathematical sense? As a way of "cheating" - the Iris data only has 3 species. Thus, if a given clustering algorithm returns significantly more than 3 clusters, we know that the graph and/or the clustering algorithm may not be the best choice. However, in real applications, we are unable to know how many "true" classes exist within the data.

 library(igraph)
    library(network)
    library(reshape2)
    library(mstknnclust)
    library(visNetwork)
    library(cluster)
    
    
    /****louvain clustering done on a distance based graph - maybe this is correct****/
    x <- iris[,1:4]
    
    
    dist <- daisy(x,
                       
                        metric = "euclidean"
                       
                        )
    
    d_mat <- as.matrix(dist)
    
     d_long <- melt(d_mat)
    colnames(d_long) <- c("from", "to", "correlation")
    d_mat_long <- d_long[which(d_long$correlation > .5),]
    graph <- graph_from_data_frame(d_mat_long, directed = FALSE)
    
    
     nodes <- as_data_frame(graph, what = "vertices")
    
     colnames(nodes) <- "id"
    nodes$label <- nodes$id
     links <- as_data_frame(graph, what = "edges")
    
    visNetwork(nodes, links) %>%   visIgraphLayout(layout = "layout_with_fr")
     cluster <- cluster_louvain(graph)
    
    nodes$cluster <- cluster$membership
    
    nodes$color <- ifelse(nodes$cluster == 1, "red", "blue")
    
     visNetwork(nodes, links) %>%   visIgraphLayout(layout = "layout_with_fr") %>%   visOptions(selectedBy = "cluster") %>%   visNodes(color = "color")
    
    
    
    
    /***infomap and louvain clustering done a distance based graph but with a different algorithm: I think this is wrong***/
    
    imc <- cluster_infomap(graph)
    membership(imc)
    communities(imc)
    plot(imc, graph)
    
    lc <- cluster_louvain(graph, weights = NULL)
    membership(lc)
    communities(lc)
    plot(lc, graph)
    
    /****mst spanning algorithm on the knn graph : based on the number of clusters I think this is wrong****/
    
    cg <- generate.complete.graph(1:nrow(x),d_mat)
    
    ##Generates kNN graph
    knn <- generate.knn(cg)
    
    plot(knn$knn.graph,
    main=paste("kNN \n k=", knn$k, sep=""))
    
    results <- mst.knn(d_mat)
    
    igraph::V(results$network)$label.cex <- seq(0.6,0.6,length.out=2)
    
    plot(results$network, vertex.size=8,
         vertex.color=igraph::clusters(results$network)$membership,
         layout=igraph::layout.fruchterman.reingold(results$network, niter=10000),
         main=paste("MST-kNN \n Clustering solution \n Number of clusters=",results$cnumber,sep="" ))
    
    /*****louvain clustering and infomap done on the knn graph - maybe this is correct****/
    
    #louvain
    lc <- cluster_louvain(knn$knn.graph, weights = NULL)
    membership(lc)
    communities(lc)
    plot(lc, knn$knn.graph)
    
    imc <- cluster_infomap(knn$knn.graph)
    membership(imc)
    communities(imc)
    plot(imc, knn$knn.graph)
stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

1

"louvain clustering done on a distance based graph - maybe this is correct"

Not really, distance is used when graphing things like betweenness centrality. If your interest is similarity then convert distance to similarity.

Brad
  • 580
  • 4
  • 19