3

I have a network that is a graph network and it is the Email-Eu network that is available in here.

This dataset has the actual dataset, which is a graph of around 1005 nodes with the edges that form this giant graph. It also has the ground truth labels for the nodes and its corresponding communities (department). Each one of these nodes belongs to one of each 42 departments.

I want to run a community detection algorithm on the graph to find to the corresponding department for each node. My main objective is to find the nodes in the largest community.

So, first I need to find the first 42 departments (Communities), then find the nodes in the biggest one of them.

I started with Girvan-Newman Algorithm to find the communities. The beauty of Girvan-Newman is that it is easy to implement since every time I need to find the edge with the highest betweenness and remove it till I find the 42 departments(Communities) I want.

I am struggling to find other Community Detection Algorithms that give me the option of specifying how many communities/partitions I need to break down my graph into.

Is there any Community Detection Function/Technique that I can use, which gives me the option of specifying how many communities do I need to uncover from my graph? Any ideas are very much appreciated.

I am using Python and NetworkX.

sentence
  • 8,213
  • 4
  • 31
  • 40
Jan
  • 747
  • 1
  • 10
  • 29
  • Some clustering methods proposed in the [sk-learn](https://scikit-learn.org/stable/modules/clustering.html#clustering) package allow you to cluster your graph with a specified number of clusters. There might be one fit for community detection with some little tweaks on the graph. – m.raynal Apr 18 '19 at 13:42

3 Answers3

1

A (very) partial answer (and solution) to your question is to use Fluid Communities algorithm implemented by Networkx as asyn_fluidc.

Note that it works on connected, undirected, unweighted graphs, so if your graph has n connected components, you should run it n times. In fact this could be a significant issue as you should have some sort of preliminary knowledge of each component to choose the corresponding k.

Anyway, it is worth a try.

sentence
  • 8,213
  • 4
  • 31
  • 40
1

You may want to try pysbm. It is based on networkx and implements different variants of stochastic block models and inference methods.

If you consider to switch from networkxto a different python based graph package you may want to consider graph-tool, where you would be able to use the stochastic block model for the clustering task. Another noteworthy package is igraph, may want to look at How to cluster a graph using python igraph.

The approaches directly available in networkx are rather old fashioned. If you aim for state of the art clustering methods, you may consider spectral clustering or Infomap. The selection depends on your desired usage of the inferred communities. The task of inferring ground truth from a network, falls under (approximate) the No-Free-Lunch theorem, i.e. (roughly) no algorithm exists, such that it returns "better" communities than any other algorithm, if we average the results over all possibilities.

Sparky05
  • 4,692
  • 1
  • 10
  • 27
  • Thanks @sparky05, I spent a full day trying to install igraph and was not successfull. I really find it difficult to get it working. – Jan Apr 20 '19 at 03:33
  • 1
    That's the problem of the more sophisticated (but faster) graph packages. The installation of igraph a and graph-tool is not as simple as installing networkx. If you have the time to wait after easter holidays, I know of a coming networkx based clustering package, implementing some variants of the stochastic block model. – Sparky05 Apr 20 '19 at 10:37
  • 1
    @Jan I added the link to the repository in case you are still looking for clustering without switching from `networkx`. – Sparky05 Apr 25 '19 at 09:53
  • Thanks dear, I will try it out – Jan Apr 25 '19 at 14:50
0

I am not entirely sure of my answer but maybe you can try this. Are you aware of label propagation ? The main idea is that you have some nodes in graph which are labelled i.e. they belong to a community and you want to give labels to other unlabelled nodes in your graph. LPA will spread these labels across the graph and give you a list of nodes and the communities they belong to. These communities will be the same as the ones that your labelled set of nodes belong to.

So I think you can control the number of communities you want to extract from the graph by controlling the number of communities you initialise in the beginning. But I think it is also possible that after LPA converges some of the communities you initialised vanish from the graph due the graph structure and also randomness of the algorithm. But there are many variants of LPA where you can control this randomness. I believe this page of sklearn talks about it.

You can read about LPA here and also here

Siddhant Tandon
  • 651
  • 4
  • 15
  • Hi Siddhant, Actually the analysis I want should not start with any prior knowledge available in the ground truth label. So, it should find the communities based on the available graph only. – Jan Apr 18 '19 at 13:17
  • How about louvain clustering ? it is implemented in networkx best_partition [function](https://python-louvain.readthedocs.io/en/latest/api.html) . The resolution parameter gives more or less communities but you cant specify like a number. There was a discussion about this param [here](https://github.com/biolab/orange3/issues/3184). – Siddhant Tandon Apr 18 '19 at 13:34
  • I tried it. However, it does not allow me to choose the number of communities that I need to partition into. It partitions based on the best possible fit. – Jan Apr 18 '19 at 13:36
  • Umm another try, if you can come up with a metric that captures distance between two nodes ( eg edge weight or degree of the node ) you can do run kmeans on it and eventually set the number of clusters you want ? – Siddhant Tandon Apr 18 '19 at 13:55