How can I prove the "Six Degrees of Separation" concept programmatically?

Question

I have a database of 20 million users and connections between those people. How can I prove the concept of "Six degrees of separation" concept in the most efficient way in programming?

link to the article about Six degrees of separation

score 11 · Accepted Answer · answered Jun 12 '09 at 20:41

11

You just want to measure the diameter of the graph. This is exactly the metric to find out the seperation between the most-distantly-connected nodes in a graph.

Lots of algorithms on Google, Boost graph too.

answered Jun 12 '09 at 20:41

SPWorley

11,550
9
43
63

Is six degress a maximum or an average? Most of the actual analysis I've read about uses the average not the maximum. – Kathy Van Stone Jun 12 '09 at 21:06
The common conception of "six degrees of separation" is that it is a maximum. Of course, that's not true at all in reality. It's just more impressive to say it that way and hard to find counterexamples. – Tyler McHenry Jun 12 '09 at 21:13

Martin Konicek · Answer 2 · 2009-06-15T10:10:11.967

4

You can probably fit the graph in memory (in the representation that each vertex knows a list of its neighbors).

Then, from each vertex n, you can run a breadth-first search (using a queue) to the depth of 6 and count number of vertices visited. If not all vertices are visited, you have disproved the theorem. In other case, continue with next vertex n.

This is O(N*(N + #edges)) = N*(N + N*100) = 100N^2, if user has 100 connections on average, Which is not ideal for N=20 million. I wonder if the mentioned libraries can compute the diameter in better time complexity (general algorithm is O(N^3)).

The computations for individual vertices are independent, so they could be done in parallel.

A little heuristic: start with vertices that have the lowest degree (better chance to disprove the theorem).

edited Jun 15 '09 at 10:10

answered Jun 12 '09 at 21:14

Martin Konicek

39,126
20
90
98

I think this is considerably worse than O(n^2). even assuming each node is connected to only 3 other nodes, a stack trace of depth 6 would be 3 * 2^0 + 3 * 2^1 + 3 * 2^2 + 3* 2^3 + 3* 2^4 + 3*2^5. Exponential growth – patros Jun 12 '09 at 21:38
1

For each vertex you visit each vertex at most once, so the run for one vertex takes O(N). – Martin Konicek Jun 13 '09 at 03:01
1

Ah, true, that is a limit on it. I think this is still O(N^3), then isn't it? Finding a path from vertex A to vertex B is O(N), and you must do this O(N^2) times. – patros Jun 13 '09 at 23:20
Hi, imagine you want to compute shortest distance from one vertex to every other vertex in the graph. Since the edges don't have weights (length of path = # of edges), this can be done in O(N+#edges): run a breadth first search - http://en.wikipedia.org/wiki/Breadth-first_search. You are right, it's not O(N) but O(N+#edges), where #edges is like 100*N (if user has 100 connections on average). Thanks – Martin Konicek Jun 15 '09 at 10:06

score 2 · Answer 3 · answered Jun 12 '09 at 21:30

2

I think the most efficient way (worst case) is almost N^3. Build an adjacency matrix, and then take that matrix ^2, ^3, ^4, ^5 and ^6. Look for any entries in the graph that are 0 for matrix through matrix^6.

Heuristically you can try to single out subgraphs ( large clumps of people who are only connected to other clumps by a relatively small number of "bridge" nodes ) but there's absolutely no guarantee you'll have any.

answered Jun 12 '09 at 21:30

patros

7,719
3
28
37

You can't build an adjacency matrix of size 20x20 million in memory. Also, the multiplication would be O(N^3), where N is 20 million. – Martin Konicek Jun 12 '09 at 21:32
It's roughly n^2.8 with strassen's algorithm, since they're square matrices. you also don't need to keep the whole matix in memory, only the parts you're actively multiplying. Page the rest out to disk. – patros Jun 12 '09 at 21:43
Does require a lot of disk though... 400 TB for a naive approach. Lots of room for compression though. – patros Jun 12 '09 at 21:50

score 2 · Answer 4 · answered Jun 15 '09 at 10:18

Well a better answer has already been given, but off the top of my head I would have gone with the Floyd-Warshall all pairs shortest path algorithm, which is O(n^3). I'm unsure of the complexity of the graph diameter algorithm, but it "sounds" like this would also be O(n^3). I'd like clarification on this if anyone knows.

On a side note, do you really have such a database? Scary.

How can I prove the "Six Degrees of Separation" concept programmatically?

4 Answers4

Linked