10

Given a random simple graph of N nodes, N edges, and a uniform edge probability p, what is the expected size of the largest connected component?

  • The only initial input parameter supplied is N, which represents the number of nodes in the graph
  • The graph is simple, meaning that it is not allowed to contain any self-loops
  • There are exactly N node pairings, ie. the edgelist is of the form {(1,a), (2,b), (3,c), ..., (N,[next available letter of the alphabet])}. The conditions are that a!=1, b!=2, c!=3 etc. but other than that, a,b,c,... can take any value from 1 to N inclusive
  • The node n forming an edge with some other node can be described by a uniform probability distribution
  • Foreach such valid graph constructed of N nodes and N edges, find the size of the largest connected component S
  • By connected component, this is defined as the largest cluster/group of nodes where each node in the connected component can reach to/is reachable from every other node in the same connected component

The question is: among all such valid graphs constructed, what is the expected value of S?

I would like to know of an efficient way to think about/analyze/solve this problem that can deal with N ranging from 2 to 50 inclusive. Some code to illustrate would be helpful to my understanding.


I wrote the following source codes to naively bruteforce the possibilities, hoping to find a pattern from there. Managed to get up to N = 9 with these:

Both attempt to generate all possible permutations according to the criteria specified in the problem, and then calculate S for each graph generated via BFS.

Output so far

As for the format, eg. for the line "N=4: {2:3,4:78} 106/27", this means that there are 3 graphs with the size of the largest component = 2, and 78 graphs with the size of the largest component = 4, so the final answer = (3/(78+3))*2 + (78/(78+3))*4 = 106/27.

N=2: {2:1} 2/1
N=3: {3:8} 3/1
N=4: {2:3,4:78} 106/27
N=5: {3:80,5:944} 155/32
N=6: {2:15,3:640,4:1170,6:13800} 17886/3125
N=7: {3:840,4:21840,5:19824,7:237432} 38563/5832
N=8: {2:105,3:17920,4:229320,5:422912,6:386400,8:4708144} 6152766/823543
N=9: {3:153440,4:786240,5:9634464,6:9273600,7:8547552,9:105822432} 17494593/2097152

edit: Just discovered this OEIS sequence #A000435 gives the answers (formula/listing up to N=18) for the number of graphs with N nodes and largest size component = N. This is exactly coincidental with my bruteforce output, eg. for N=9, there are 105822432 graphs with the largest size of connected component = 9. Not sure if this helps - one of the formulas given there to derive this for larger N is a(n) = (n-1)! * Sum (n^k/k!, k=0..n-2) Python implementation


Example for N = 4:

There are a total of 81 possible edge assignments for 4 nodes (1-based indexed from 1 to N).

The format for the sample output below is: ((1, 2), (2, 1), (3, 1), (4, 1)): 4 means that there are edges between nodes 1 and 2, nodes 2 and 1, nodes 3 and 1, and nodes 4 and 1. Assume the edges are undirected/bidirectional. For such a graph, there is only a single connected component of all 4 nodes {1,2,3,4}, so the size of the largest (only) connected component is 4.

After generating this list, expected value of S can be computed via (fraction of all 81 instances whereS== 4) * 4 + (fraction of all 81 instances whereS== 2) * 2 =106/27 - since the only values S takes are 2,4.

{((1, 2), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 3)): 2,
 ((1, 2), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 1), (4, 2)): 2,
 ((1, 3), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 2), (4, 1)): 2,
 ((1, 4), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 3)): 4}
Adeline Ho
  • 119
  • 1
  • 6
  • You say *must be joined to exactly 1 other node*. I don't get it. That makes N/2 components, no? – aioobe Feb 15 '15 at 15:55
  • I meant there must be exactly 1 unique pairing for every node, that cannot be itself, ie. if there are 3 nodes, i must have an edgelist of the form {(1, a) , (2, b), (3, c)}, where a!=1, b!=2, c!=3, and a,b,c can take any other value from 1 to **N** inclusive. – Adeline Ho Feb 15 '15 at 16:01
  • also: edited the question to try to clarify this. thanks & sorry for the confusion - it is indeed possible, as you've tried to point out, that one node can be connected to multiple other nodes (or not); it's just a restriction on the edgelist – Adeline Ho Feb 15 '15 at 16:09
  • 2
    In what sense is this a bipartite graph? (Normally there would two different sets of vertices A/B and edges only between vertices in different sets if this is a bipartite graph) – Peter de Rivaz Feb 15 '15 at 16:11
  • Exactly. That makes the whole self-reference constraint redundant. Something is messed up with this problem formulation. – aioobe Feb 15 '15 at 16:12
  • oh, i think it must be just my unclear understanding of the terms. let me remove `bipartite` then. thought the example would help there – Adeline Ho Feb 15 '15 at 16:18
  • ok, I've made the corrections – Adeline Ho Feb 15 '15 at 16:21
  • @AdelineHo, you probably want to clarify that the graph is *directed* right in the beginning. – aioobe Feb 15 '15 at 16:24
  • ok, done. thanks for the guidance. does the formulation of the question make sense now? – Adeline Ho Feb 15 '15 at 16:31
  • First you say that the graph is directed, but then you say "Assume the edges are undirected/bidirectional". Which is it? – interjay Feb 15 '15 at 17:17
  • What is the meaning of connected: weakly connected, connected or strongly connected (if the question is from a Google interview then this must have been specified) – Rerito Feb 15 '15 at 18:57
  • The "directed" was added there by suggestion - see earlier comment; but then later I realised that that's not what I mean, so I've already removed it now. – Adeline Ho Feb 16 '15 at 01:40
  • Well, connected means: Two nodes A and B belong to the same cluster/group if they share an edge ie. A-B exists in the edgelist, or if A shares an edge with a node of the same cluster/group as B. – Adeline Ho Feb 16 '15 at 01:42
  • I reworked my answer and established a computable result. – Rerito Feb 16 '15 at 07:40
  • 1
    For the process you describe, among the N edges, an edge (i, j) can occur twice, right? i can choose j, and j can choose i? – Neal Young Apr 01 '18 at 18:39

3 Answers3

2

Before looking at the probabilities, let us look at a similar problem in combinatorics, i.e. given the rules of the graphs and N nodes, how many graphs can be attained when the largest connected component is S.

Notation

For the rest on this answer I will use the following notation:

C(S=s|N) notes the total count of possible graphs with N nodes and N edges as defined in the question with the largest connects component is S=s.

Examples:

  1. C(S=2|2) = 1
  2. C(S=2|3) = 0
  3. C(S=3|3) = 8

T(s) notes the number of graphs with s nodes and s edges as defined in the question where all nodes are connected. T(s) = C(S=s|s).

Examples

  1. T(2) = 1
  2. T(3) = 8

Key observations

Some key observation to lay out the stages of the solution:

  1. C(S=N-1|N) = 0 since the last node must be connected to itself. This is however not a legal graph and therefore the count for this case is 0
  2. C(S=2|N) is different than 0 only for even values of N. If N is odd, there has to be a node connected to itself and again, this is illegal.
  3. C(S=2|N) = (N choose 2) * C(S=2|N-2). This also correlates with 2. Since this recursion will eventually reach C(S=2|1) =0 and the entire expression will be 0 for odd N values.
  4. Expanding 3. to a slightly more general case we can get C(S=2|N) = (N choose 2) * T(2) * C(S=2|N-2) since T(2) = 1 this holds.
  5. Expanding 4. for a general case we get C(S=s|N) = (N choose s) * T(s) * [C(S=2|N-s) + C(S=3|N-s) + ... + C(S=s|N-s)]

Note that these expressions include double counts, we will get to that later.

This means that we found a recursive pattern to compute the number of graph (up to double counts), all we need is a method to compute T(s).

Computing T(s)

For a N nodes N edges graph as defined in the question, T(N) can be computed via three components:

  1. (N choose N-1) * T(N-1) * (N-1): The fully connected structure has a N-1 connected graph and the new node is connected to either of the N-1 nodes creating a N connected graph. This section contains double counts.
  2. Number of graphs creating a cycle the size of N which is (N-1)!
  3. correcting double counts

Therefore we can compute T(s) as follows (the explanation for the double counts in this case is too complex for me to write down an explanation): enter image description here

We note that for T(2) and T(3) we do not have double counts.

Examples

  1. T(3) = (3 choose 2)*T(2)*2 + 2! = 3*1*2+2=8
  2. T(4) = (4 choose 3)*T(3)*3 + 3! - (4 choose 2)*T(2)*2^2 = 4×8×3 +6 -6*1*4=78
  3. T(5) = (5 choose 4)*T(4)*4 + 4! - (5 choose 3)*T(3)*3^2 + (5 choose 2)*T(2)*2^3 = 5*78*4 +24 -10*8*9 +10*1*2^3 = 944

The only thing left is to find a way to overcome double counts in key observation 5. This is done using the link above. After breaking N to a different groups, we correct the double count by dividing the result by a factor to complete it.

Examples

  1. In case the have N=6 nodes and we compute
C(S=2|6) = (6 choose 2)*T(2)*C(S=2|4) = (6 choose 2)*T(2)*(4 choose 2)*T(2)

We can see that we divided 6 into three groups of 2 therefore we need to correct the double count by applying a factor of 3!:

C(S=2|6) = (6 choose 2)*T(2)*(4 choose 2)*T(2)/(3!) = 15
  1. In case the have N=9 nodes and we compute
C(S=3|9) = (9 choose 3)*T(3)*[C(S=2|6) + C(S=3|6)]

We can see that the first component of the brackets, i.e. C(S=2|6) was computed in example 1, and a factor of 3! was divided from the result because 6 was divided into three groups of 2. for the second component of the brackets we get:

C(S=3|9) = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*C(S=3|3)] = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*T(3)]

We see that 9 was divided into three groups of 3 and therefore we add a factor of 3! to counter the double counts:

C(S=3|9) = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*T(3)/(3!)] = 153440

Everything is now set up to compute C(S=s|N) for all s valuees between 2 and N. Then the probability can achieved by dividing by the sum of all graphs:

P(S=s|N) = C(S=s|N) / sum_s{C(S=s|N)}

And the expected value can be found via:

E[S|N] = sum_s{s * P(S=s|N)}

Implementation

Below I will show a python implementation for the method derived above. I will compute the values including the double counts to keep the recursive property, thus allowing me to compute all the graph counts in O(N^2) and I will keep record of the splitting to know how to correct the double counts later on. The code for the counts with comments is as follows:

import numpy as np
from math import comb
from collections import Counter

class ExpectedSize:
    def __init__(self, N):
        self.N = N
        self.counts           = None
        self.group_dict       = None
        self.corrected_counts = None
        self.expected_size    = None

    def get_T(self):
        """
        :param N: Number of nodes and edges
        :return: A vector where the ith location contains the number of graphs with i nodes and i edges where all the nodes
         are connected
        """
        T = np.zeros(self.N + 1)
        for ii in range(len(T)):
            if ii < 2:
                T[ii] = 0
            elif ii == 2:
                T[ii] = 1
            else:
                n_vec      = np.arange(1, ii-2)
                sign_vec   = (-1) ** n_vec
                choose_vec = [comb(ii, ii-1-jj) for jj in n_vec]
                t_vec      = T[ii-2:1:-1]
                power_vec  = (ii -1 - n_vec) ** (n_vec + 1)
                T[ii] = ii * T[ii - 1] * (ii - 1) + np.math.factorial(ii - 1) + np.sum(sign_vec * choose_vec * t_vec * power_vec) # X choose (X-1) = X
        return T

    def get_counts(self):
        # Init
        T_s = self.get_T()
        # counts = np.zeros((self.N + 1, self.N + 1))
        # counts[:, 0] = 1
        # counts[2, 2] = 1
        counts = {(2,2): [1]}
        for ii in range(self.N + 1):
            counts[(ii, 0)] = [1]
        group_dict = {(2,2): [[2]]}
        for NN in range(3, self.N + 1):
            for ss in range(2, NN+1):
                # Computing C(S = ss | ii) without the double-counting correction
                init_term = comb(NN, ss) * T_s[ss]
                # creating new split list
                split_list = []
                count_list = []
                for sigma in range(2, ss+1):
                    key = (sigma, NN-ss)
                    if NN-ss == 0 and sigma > 2:
                        break
                    # Getting sum components
                    temp_components = counts.get((sigma, NN-ss), [0])
                    # Updating split counts
                    group_count = group_dict.get(key, [[]])
                    # appending split list to the total list if needed
                    for gg in group_count:
                        if temp_components[0] > 0:
                            split_list.append(gg + [ss])
                            group_dict[(ss, NN)] = split_list
                            count_list += [temp_components[0]]
                counts[(ss, NN)] = [init_term * cc for cc in count_list] if len(count_list) > 0 else [0]
        self.counts     = counts
        self.group_dict = group_dict

    def correct_double_counts(self):
        counts           = self.counts
        group_dict       = self.group_dict
        corrected_counts = {}
        for key in counts:
            group_count    = group_dict.get(key, [[]])
            sum_components = counts[key]
            corrected_temp = []
            for ii, sum_component in enumerate(sum_components):
                counter_dict = Counter(group_count[ii])
                corrected_temp += [sum_component]
                for split_key in counter_dict:
                    corrected_temp[-1] = corrected_temp[-1]/ np.math.factorial(counter_dict[split_key])
            corrected_counts[key] = sum(corrected_temp)
        self.corrected_counts = corrected_counts

    def get_expected_size(self):
        cumsum   = 0
        expected = 0
        for ii in range(1, self.N+1):
            key = (ii, self.N)
            cumsum += self.corrected_counts[key]
            expected += ii * self.corrected_counts[key]
        self.expected_size = expected / cumsum

Running the following code for N=9 results in:

exp_size = ExpectedSize(9)
exp_size.get_counts()
exp_size.correct_double_counts()
print(exp_size.corrected_counts)
(2, 2): 1.0,
(2, 3): 0, (3, 3): 8.0,
(2, 4): 3.0, (3, 4): 0, (4, 4): 78.0,
(2, 5): 0, (3, 5): 80.0, (4, 5): 0, (5, 5): 944.0,
(2, 6): 15.0, (3, 6): 640.0, (4, 6): 1170.0, (5, 6): 0, (6, 6): 13800.0,
(2, 7): 0, (3, 7): 840.0, (4, 7): 21840.0, (5, 7): 19824.0, (6, 7): 0, (7, 7): 237432.0,
(2, 8): 105.0, (3, 8): 17920.0, (4, 8): 229320.0, (5, 8): 422912.0, (6, 8): 386400.0, (7, 8): 0, (8, 8): 4708144.0,
(2, 9): 0, (3, 9): 153440.0, (4, 9): 786240.0, (5, 9): 9634464.0, (6, 9): 9273600.0, (7, 9): 8547552.0, (8, 9): 0, (9, 9): 105822432.0}

We can see that the counts match the brute force computations of OP (the keys are (S,N).

Counts to expected value

Now the expectation computation can be done with ease using the get_expected_size function

Examples

  1. N=2 --> E[S|N] = 2
  2. N=3 --> E[S|N] = 3
  3. N=4 --> E[S|N] = 3.925925925925926 ; 126 / 27 = 3.925925926
  4. N=5 --> E[S|N] = 4.84375 ; 155 / 32 = 4.84375
  5. N=6 --> E[S|N] = 5.72352 ; 17886 / 3125 = 5.72352
  6. N=7 --> E[S|N] = 6.612311385459534 ; 38563 / 5832 = 6.612311385
  7. N=8 --> E[S|N] = 7.471092584115219 ; 6152766 / 823543 = 7.471092584
  8. N=9 --> E[S|N] = 8.342072010040283 ; 17494593 / 2097152 = 8.34207201
  9. N=10 --> E[S|N] = 9.189007166835722
  10. N=11 --> E[S|N] = 10.047272752
  11. N=12 --> E[S|N] = 10.891595346262637

And we can see that the overall results are accurate w.r.t the brute force results up to precision point of Python. I this is a sufficient solution.

Tomer Geva
  • 1,764
  • 4
  • 16
1

I'm going to start with some assumptions regarding the problem:

  • Meaning of connected component: consider the transformation of the graph where each edge becomes undirected. Two vertices are considered connected if there is a path between them in the transformed (undirected) graph.
  • If there is a (directed) edge (i,j) linking two vertices i, j, there can still be another (directed) edge (j,i) starting from j.

C(k,n) will denote the binomial coefficient "n choose k".

Analysis

Let Gi(N) be a subset of i vertices of the graph G(N) (there are C(i,N) such subsets). Gi(N) can define a connected component of size i only if the two following conditions are true:

  • The edges starting from the vertices of Gi(N) end in an element of Gi(N): event einGi
  • The edges starting from the vertices of G \ Gi(N) end in G \ Gi(N): event eoutGi

These two events (einGi and eoutGi) are independent because the edges are chosen independently during the graph construction.

Furthermore, there can be at most one cycle (i.e edges with the shape (i,j),(j,k), ..., (l,m), (m,i)) in Gi. More than one cycle would lead to more than one connected component in the i vertices we consider, since there is only one edge starting from a given vertex. This defines the event e2- cycles.

Thus, the overall probability of having a connected component of size i defined by the subset Gi (event ei-cc) is:

p(ei-cc) = p(einGi).p(e2- cycles|einGi). p(eoutGi)

To have a connected component of size at least i is the same as having a connected component of size i, except this time we remove the constraints imposed by event eoutGi. This defines event eleast,i


Computing the probabilities

The probability to have at least two cycles in an isolated set of K vertices is:

2cycles

Proof: To define two cycles, we need two set of vertices containing at least two vertices each. Let k1 and k2 be the number of vertices in these sets.

Defining a cycle on a set of vertices is equivalent to choosing a circular order and link the vertices. Given a regular order for set 1 (resp. 2) defining a circular order, there are k1 (resp. k2) other orders that leads to the same circular order (they belong to the same equivalence class). There are thus (k1-1)! (resp. k2-1)!) circular orders for set 1 (resp.2). This statement explains the (X-1)! bit of the formula.

We take i elements among the K we consider (i > 3). We are interested in building two cycles with these i elements (regardless of the remaining K - i elements). Since we ignore the remaining part, we must use a balance factor to compensate for the fact that we are going to count several times the same cycles (hence the 1 + C(i, K-i) bit). We split these i elements in two sets (hence the sum j=2..i/2). We take j elements among the i. Here, there are no leftover element in the i elements. However, in the case where j = i-j, we will count exactly each pair of cycles twice, thus we balance using the Kronecker symbol (delta(i,j) = 1 iif (i = j))

We have our two sets, now we count the number of cycles ((k1-1)!.k1-1)! with k1 = j and k2 = i-j) and divide them by the number of possible ways the edges can be built on the i vertices ((K-1)**i)

Note: A peer review of this probability would be appreciated.


We can now compute the probability of having a connected component of cardinality K in such a graph of N vertices and edges:

pcceq

Note: The first term denotes the choice of the K vertices. The odd 1 + C(...) part is a balance factor to account for the fact that given a K-combination X, X will appear on the "N-K" side when considering some of the others K-combinations.

The two following terms represent the mutual isolation between the K chosen vertices and the remaining N-K vertices of the graph.


Similarly, the probability to have a connected component of size at least K is the same as having a connected component of size K except that we remove the term corresponding to the event eoutGi described before:

pccsup


So far so good. The thing is... There may be more than one connected component in the graph! So we cannot simply average the formula above...

The probability of having a maximum connected component of size i can be expressed using the pk(N) probabilities: it's the conjunction of the following events:

  • Having a connected component of cardinality i
  • Not having a connected component of cardinality at least i+1 in the N-i remaining vertices.

These two events are independent because the connected component of cardinality i is disjoint from the rest of the graph.

Thus, the probability that the maximum connected component is of size K is:

pmax(K,N) = pCCeq(K,N). (1 - pCCsup(K+1,N-K))

All that's left to do is to average this probability. The expected size S(N) of the largest connected component in such a random graph with N vertices and edges is:

S(N) = Σ2≤K≤N K.pmax(K,N)


Quick example

Let N = 5. We have:

  • pmax(1,5) = 0
  • pmax(2,5) = 0
  • pmax(3,5) = C(3,5).(1/2)3.(1/4)2 = 10/(2.43) = 80/(45)
  • pmax(4,5) = 0
  • pmax(5,5) = (1 - p2cycles(5)) = 1 - 80/(45)

Thus, the pmax(k,5) are always 0, except for k = 3 and k = 5. The cycle formula gives 20 couples of 2-cycles/3-cycles over 45 possible arrangements and 15 couples of 2-cycles/2-cycles over 44 possible arrangements. The sum is thus 80/(45), which is exactly pmax(3,5).


I guess you can now use a computer to get the real values from that formula? :)

Disclaimer: I checked the 2-cycles formula for N=3,4,5 by hand and it seems to hold well. However, an external review would be very helpful. Note that these computations have been made under the assumptions found at the beginning of my answer.

Again, thanks to @DavidEisenstat for his useful comment which led to this more rigorous (and hopefully correct) answer.

Community
  • 1
  • 1
Rerito
  • 5,886
  • 21
  • 47
  • Judging from the example "((1, 2), (2, 1), (3, 1), (4, 1))" with a connected component size of 4, I don't think the question is about strongly connected components. – Peter de Rivaz Feb 15 '15 at 17:15
  • The way the question defined "connected component" is not necessarily a strongly connected component (e.g. it would consider 1->2->3 as a connected component). But that may be an error in the formulation of the question. – interjay Feb 15 '15 at 17:16
  • He stated `By connected component, this is defined as the largest cluster/group of nodes where each node in the connected component can reach to/is reachable from every other node in the same connected component`. So to me each node in the connected component must reach every other node in the component... How is that different from a strongly connected component? – Rerito Feb 15 '15 at 17:17
  • `can reach to/is reachable from` implies that either one is fine. A strongly connected component requires both. But now I see that the question also says "Assume the edges are undirected/bidirectional" so I guess OP is confused, and it's hard to know what the original intent of the interview question was. – interjay Feb 15 '15 at 17:19
  • @interjay take (1->2->3), 2 can neither reach any node (can't reach 1), nor is reachable by any node in the component (can't be reached by 3)... – Rerito Feb 15 '15 at 17:20
  • 2 can reach 3, and is reachable from 1. – interjay Feb 15 '15 at 17:20
  • hopefully i'm right when i say that i'm not referring to *strongly* connected components. for the given example quoted in the comments `((1, 2), (2, 1), (3, 1), (4, 1))`, it means that there are edges 1-2, 2-1, 3-1, 4-1; and if a node is linked to another, it is linked to that same cluster/component that the other node belongs to (if there is one), otherwise form one with these 2 nodes, maybe can look at the bfs method in my linked source code that generated the sample output; so anyway, in this case, the entire 4 nodes are all considered as one single cluster/component/group – Adeline Ho Feb 15 '15 at 17:28
  • @interjay Guess I misunterstood the statement – Rerito Feb 15 '15 at 17:28
  • Is this explanation sufficient? when there is an edge node a-node b, node a and b belong to the same cluster/component/group, and if node b or node a is linked to a c, then all 3: node a,b,c belong to the same cluster/component/group, is my intent. – Adeline Ho Feb 15 '15 at 17:31
  • Yea, as shown in my example input for `N = 4`, there can be 2 components for certain edgelists, eg. {(node1-node2) (node3-node4)}; as an edgelist: `{(1,2),(2,1),(3,4),(4,3)}` – Adeline Ho Feb 15 '15 at 17:33
  • 2
    Multiplying probabilities like that without an argument as to why the corresponding events are independent is usually a bad sign. – David Eisenstat Feb 16 '15 at 14:22
  • Your assumptions should be correct (in the sense that it's aligned with what I'm trying to convey) – Adeline Ho Feb 16 '15 at 15:54
0

This is not a solution yet but some observations.

Firstly with n edges and n vertices, and each vertex is connected to some other vertex except itself, any strongly connected component of size k in this graph must have exactly k edges. This follows from

  • any SCC of size k must have at least k-1 edges. In this graph it's not possible for any SCC to have exactly k-1 edges because if that's the case then there must be a vertex in the SCC that's connected to another vertex that's not in the SCC, hence invalidating the assumption that this SCC is of size k.
  • It is also not possible in this graph that any SCC have more than k+1 edges, because each vertex is connected to exactly one other vertex.

Now suppose we want to count the total number valid graphs constructed in this question, and dp[n][k] is the total number of valid graphs with n vertices and maximum SCC of size k. Then dp[n][k] is

  • Somehow choose k vertices from n vertices and form an SCC, and
  • The remaining n-k vertices must not have an SCC of size larger than k.

Therefore the DP formulation becomes:

dp[n][k]=C(n,k)*dp[n-k][k] + 
         C(n,k)*dp[n-k][k-1] +
         C(n,k)*dp[n-k][k-2] + 
         ...
         C(n,k)*dp[n-k][2] +
         C(n,k)*dp[n-k][1]

where C(n,k) is n!/[(n-k)!k!]

(I am not sure if this is right, but hopefully on the right direction)

The total number of valid graphs of n vertices is (n-1)**(n-1). The expected size of the largest SCC can be worked out from the basic principle of expectation.

Jason L
  • 510
  • 5
  • 16