2

I want to compute the clustering coefficient in a graph with more than 1.5M vertices.

I have a Dictionary with vertex ID as key and the value is a List with all the vertices which are connected to the vertex ID.

clustering coefficient = 3 * triangles / number of connected triplets.

The problem is: it takes me more than 4 hours to compute the number of triangles in the graph.

My code:

List<string> list_t = new List<string>();
Dictionary<string, List<string>> copy = new Dictionary<string, List<string>>(Global.dict_edge_undirected);
int triangles = 0;                          // (number of triangles in graph)*3
foreach (KeyValuePair<string, List<string>> pair in copy)
{
    if (pair.Value.Count > 1)
    {
        foreach (string neigh1 in pair.Value)
        {
            list_t = copy[neigh1];
            foreach (string neigh2 in pair.Value)
            {
                if (neigh1 != neigh2 && list_t.Contains(neigh2))
                {
                    triangles++;

                }
            }
        }
    }
}

How can I reduce the run time?

C++ igraph library calculates the clustering coefficient of this graph in less than 3 minutes.

What am I missing?

Chris Mantle
  • 6,595
  • 3
  • 34
  • 48
  • Use a profiler to find the slowest code: http://stackoverflow.com/questions/3927/what-are-some-good-net-profilers – Stormenet Dec 02 '14 at 13:31
  • 2
    I'm having a bit of trouble trying to figure out what your code is doing here. What is the purpose of the second nested foreach that is iterating over the same values? – Jesse Carter Dec 02 '14 at 13:33
  • Single or multi threaded? I would go multi threaded here. oN a 72 core machine you can cut that time down quite a lot. Or dump C# and go onto the GPU for this. – TomTom Dec 02 '14 at 13:34
  • 1
    Use a graph library like Quickgraph. Graph libraries use specialized algorithms to achieve good performance. A brute-force loop will never achieve the same performance as a good algorithm. Multithreading isn't an answer either. A good single-threaded algorithm can beat a bad multithreaded loop. – Panagiotis Kanavos Dec 02 '14 at 13:37

1 Answers1

5

You can use HashSet<T> instead of list at least:

        var copy = new Dictionary<string, HashSet<string>>(Global.dict_edge_undirected);
        int triangles = 0;                          // (number of triangles in graph)*3
        foreach (var pair in copy)
        {
            if (pair.Value.Count > 1)
            {
                foreach (string neigh1 in pair.Value)
                {
                    triangles += pair.Value.Count(neigh2 => neigh1 != neigh2 && copy[neigh1].Contains(neigh2));
                }
            }
        }

it removes one inner cycle because Contains method is O(1) for hash table.

second - you can use int ID instead of strings (and have a Dictionary<int,string> or simply string[] to get your string by ID). It will remove extra comparasion for strings with same hashcode. And in this case you don't need to use dictionary, and you can just use an array:

        const int N = 100;
        var copy = new HashSet<int>[N];
        int triangles = 0;                          // (number of triangles in graph)*3
        foreach (var pair in copy)
        {
            if (pair.Count > 1)
            {
                foreach (int neigh1 in pair)
                {
                    triangles += pair.Count(neigh2 => neigh1 != neigh2 && copy[neigh1].Contains(neigh2));
                }
            }
        }

all operations with array are optimized by JIT.

finally, you can use Parallel to boost performance (local variable is used to minimize Interlocked.Add calls)

        int totalTriangles = 0;
        Parallel.ForEach(copy, () => 0,
                         (set, _, __) =>
                         {
                             int triangles = 0;
                             if (set.Count > 1)
                             {
                                 foreach (int neigh1 in set)
                                 {
                                     triangles += set.Count(neigh2 => neigh1 != neigh2 &&
                                                                      copy[neigh1].Contains(neigh2));
                                 }
                             }
                             return triangles;
                         },
                         i => Interlocked.Add(ref totalTriangles, i));

or using PLINQ:

const int N = 100;
var copy = new HashSet<int>[N];
int triangles = copy.AsParallel().Where(pair => pair.Count > 1).Sum(pair => pair.Sum(neigh1 => pair.Count(neigh2 => neigh1 != neigh2 && copy[neigh1].Contains(neigh2))));
Alex Zhukovskiy
  • 9,565
  • 11
  • 75
  • 151