Google interview algorithm puzzle: expected size of the largest connected component in a random simple graph (N nodes, N edges)?

Question

Given a random simple graph of N nodes, N edges, and a uniform edge probability p, what is the expected size of the largest connected component?

The only initial input parameter supplied is N, which represents the number of nodes in the graph
The graph is simple, meaning that it is not allowed to contain any self-loops
There are exactly N node pairings, ie. the edgelist is of the form {(1,a), (2,b), (3,c), ..., (N,[next available letter of the alphabet])}. The conditions are that a!=1, b!=2, c!=3 etc. but other than that, a,b,c,... can take any value from 1 to N inclusive
The node n forming an edge with some other node can be described by a uniform probability distribution
Foreach such valid graph constructed of N nodes and N edges, find the size of the largest connected component S
By connected component, this is defined as the largest cluster/group of nodes where each node in the connected component can reach to/is reachable from every other node in the same connected component

The question is: among all such valid graphs constructed, what is the expected value of S?

I would like to know of an efficient way to think about/analyze/solve this problem that can deal with N ranging from 2 to 50 inclusive. Some code to illustrate would be helpful to my understanding.

I wrote the following source codes to naively bruteforce the possibilities, hoping to find a pattern from there. Managed to get up to N = 9 with these:

Python
C/C++

Both attempt to generate all possible permutations according to the criteria specified in the problem, and then calculate S for each graph generated via BFS.

Output so far

As for the format, eg. for the line "N=4: {2:3,4:78} 106/27", this means that there are 3 graphs with the size of the largest component = 2, and 78 graphs with the size of the largest component = 4, so the final answer = (3/(78+3))*2 + (78/(78+3))*4 = 106/27.

N=2: {2:1} 2/1
N=3: {3:8} 3/1
N=4: {2:3,4:78} 106/27
N=5: {3:80,5:944} 155/32
N=6: {2:15,3:640,4:1170,6:13800} 17886/3125
N=7: {3:840,4:21840,5:19824,7:237432} 38563/5832
N=8: {2:105,3:17920,4:229320,5:422912,6:386400,8:4708144} 6152766/823543
N=9: {3:153440,4:786240,5:9634464,6:9273600,7:8547552,9:105822432} 17494593/2097152

edit: Just discovered this OEIS sequence #A000435 gives the answers (formula/listing up to N=18) for the number of graphs with N nodes and largest size component = N. This is exactly coincidental with my bruteforce output, eg. for N=9, there are 105822432 graphs with the largest size of connected component = 9. Not sure if this helps - one of the formulas given there to derive this for larger N is a(n) = (n-1)! * Sum (n^k/k!, k=0..n-2) Python implementation

Example for N = 4:

There are a total of 81 possible edge assignments for 4 nodes (1-based indexed from 1 to N).

The format for the sample output below is: ((1, 2), (2, 1), (3, 1), (4, 1)): 4 means that there are edges between nodes 1 and 2, nodes 2 and 1, nodes 3 and 1, and nodes 4 and 1. Assume the edges are undirected/bidirectional. For such a graph, there is only a single connected component of all 4 nodes {1,2,3,4}, so the size of the largest (only) connected component is 4.

After generating this list, expected value of S can be computed via (fraction of all 81 instances whereS== 4) * 4 + (fraction of all 81 instances whereS== 2) * 2 =106/27 - since the only values S takes are 2,4.

{((1, 2), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 1), (3, 4), (4, 3)): 2,
 ((1, 2), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 2), (2, 4), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 1), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 1), (4, 2)): 2,
 ((1, 3), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 3), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 3), (2, 4), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 1), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 2), (4, 1)): 2,
 ((1, 4), (2, 3), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 3), (3, 4), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 1), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 2), (4, 3)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 1)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 2)): 4,
 ((1, 4), (2, 4), (3, 4), (4, 3)): 4}

You say *must be joined to exactly 1 other node*. I don't get it. That makes N/2 components, no? — aioobe, Feb 15 '15 at 15:55
I meant there must be exactly 1 unique pairing for every node, that cannot be itself, ie. if there are 3 nodes, i must have an edgelist of the form {(1, a) , (2, b), (3, c)}, where a!=1, b!=2, c!=3, and a,b,c can take any other value from 1 to **N** inclusive. — Adeline Ho, Feb 15 '15 at 16:01
also: edited the question to try to clarify this. thanks & sorry for the confusion - it is indeed possible, as you've tried to point out, that one node can be connected to multiple other nodes (or not); it's just a restriction on the edgelist — Adeline Ho, Feb 15 '15 at 16:09
In what sense is this a bipartite graph? (Normally there would two different sets of vertices A/B and edges only between vertices in different sets if this is a bipartite graph) — Peter de Rivaz, Feb 15 '15 at 16:11
Exactly. That makes the whole self-reference constraint redundant. Something is messed up with this problem formulation. — aioobe, Feb 15 '15 at 16:12
oh, i think it must be just my unclear understanding of the terms. let me remove `bipartite` then. thought the example would help there — Adeline Ho, Feb 15 '15 at 16:18
@AdelineHo, you probably want to clarify that the graph is *directed* right in the beginning. — aioobe, Feb 15 '15 at 16:24
ok, done. thanks for the guidance. does the formulation of the question make sense now? — Adeline Ho, Feb 15 '15 at 16:31
First you say that the graph is directed, but then you say "Assume the edges are undirected/bidirectional". Which is it? — interjay, Feb 15 '15 at 17:17
What is the meaning of connected: weakly connected, connected or strongly connected (if the question is from a Google interview then this must have been specified) — Rerito, Feb 15 '15 at 18:57
The "directed" was added there by suggestion - see earlier comment; but then later I realised that that's not what I mean, so I've already removed it now. — Adeline Ho, Feb 16 '15 at 01:40
Well, connected means: Two nodes A and B belong to the same cluster/group if they share an edge ie. A-B exists in the edgelist, or if A shares an edge with a node of the same cluster/group as B. — Adeline Ho, Feb 16 '15 at 01:42
For the process you describe, among the N edges, an edge (i, j) can occur twice, right? i can choose j, and j can choose i? — Neal Young, Apr 01 '18 at 18:39

Tomer Geva · Answer 1 · 2022-04-09T18:14:22.507

Before looking at the probabilities, let us look at a similar problem in combinatorics, i.e. given the rules of the graphs and N nodes, how many graphs can be attained when the largest connected component is S.

Notation

For the rest on this answer I will use the following notation:

C(S=s|N) notes the total count of possible graphs with N nodes and N edges as defined in the question with the largest connects component is S=s.

Examples:

C(S=2|2) = 1
C(S=2|3) = 0
C(S=3|3) = 8

T(s) notes the number of graphs with s nodes and s edges as defined in the question where all nodes are connected. T(s) = C(S=s|s).

Examples

T(2) = 1
T(3) = 8

Key observations

Some key observation to lay out the stages of the solution:

C(S=N-1|N) = 0 since the last node must be connected to itself. This is however not a legal graph and therefore the count for this case is 0
C(S=2|N) is different than 0 only for even values of N. If N is odd, there has to be a node connected to itself and again, this is illegal.
C(S=2|N) = (N choose 2) * C(S=2|N-2). This also correlates with 2. Since this recursion will eventually reach C(S=2|1) =0 and the entire expression will be 0 for odd N values.
Expanding 3. to a slightly more general case we can get C(S=2|N) = (N choose 2) * T(2) * C(S=2|N-2) since T(2) = 1 this holds.
Expanding 4. for a general case we get C(S=s|N) = (N choose s) * T(s) * [C(S=2|N-s) + C(S=3|N-s) + ... + C(S=s|N-s)]

Note that these expressions include double counts, we will get to that later.

This means that we found a recursive pattern to compute the number of graph (up to double counts), all we need is a method to compute T(s).

Computing T(s)

For a N nodes N edges graph as defined in the question, T(N) can be computed via three components:

(N choose N-1) * T(N-1) * (N-1): The fully connected structure has a N-1 connected graph and the new node is connected to either of the N-1 nodes creating a N connected graph. This section contains double counts.
Number of graphs creating a cycle the size of N which is (N-1)!
correcting double counts

Therefore we can compute T(s) as follows (the explanation for the double counts in this case is too complex for me to write down an explanation):

We note that for T(2) and T(3) we do not have double counts.

Examples

T(3) = (3 choose 2)*T(2)*2 + 2! = 3*1*2+2=8
T(4) = (4 choose 3)*T(3)*3 + 3! - (4 choose 2)*T(2)*2^2 = 4×8×3 +6 -6*1*4=78
T(5) = (5 choose 4)*T(4)*4 + 4! - (5 choose 3)*T(3)*3^2 + (5 choose 2)*T(2)*2^3 = 5*78*4 +24 -10*8*9 +10*1*2^3 = 944

The only thing left is to find a way to overcome double counts in key observation 5. This is done using the link above. After breaking N to a different groups, we correct the double count by dividing the result by a factor to complete it.

Examples

In case the have N=6 nodes and we compute

C(S=2|6) = (6 choose 2)*T(2)*C(S=2|4) = (6 choose 2)*T(2)*(4 choose 2)*T(2)

We can see that we divided 6 into three groups of 2 therefore we need to correct the double count by applying a factor of 3!:

C(S=2|6) = (6 choose 2)*T(2)*(4 choose 2)*T(2)/(3!) = 15

In case the have N=9 nodes and we compute

C(S=3|9) = (9 choose 3)*T(3)*[C(S=2|6) + C(S=3|6)]

We can see that the first component of the brackets, i.e. C(S=2|6) was computed in example 1, and a factor of 3! was divided from the result because 6 was divided into three groups of 2. for the second component of the brackets we get:

C(S=3|9) = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*C(S=3|3)] = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*T(3)]

We see that 9 was divided into three groups of 3 and therefore we add a factor of 3! to counter the double counts:

C(S=3|9) = (9 choose 3)*T(3)*[15 + (6 choose 3)*T(3)*T(3)/(3!)] = 153440

Everything is now set up to compute C(S=s|N) for all s valuees between 2 and N. Then the probability can achieved by dividing by the sum of all graphs:

P(S=s|N) = C(S=s|N) / sum_s{C(S=s|N)}

And the expected value can be found via:

E[S|N] = sum_s{s * P(S=s|N)}

Implementation

Below I will show a python implementation for the method derived above. I will compute the values including the double counts to keep the recursive property, thus allowing me to compute all the graph counts in O(N^2) and I will keep record of the splitting to know how to correct the double counts later on. The code for the counts with comments is as follows:

import numpy as np
from math import comb
from collections import Counter

class ExpectedSize:
    def __init__(self, N):
        self.N = N
        self.counts           = None
        self.group_dict       = None
        self.corrected_counts = None
        self.expected_size    = None

    def get_T(self):
        """
        :param N: Number of nodes and edges
        :return: A vector where the ith location contains the number of graphs with i nodes and i edges where all the nodes
         are connected
        """
        T = np.zeros(self.N + 1)
        for ii in range(len(T)):
            if ii < 2:
                T[ii] = 0
            elif ii == 2:
                T[ii] = 1
            else:
                n_vec      = np.arange(1, ii-2)
                sign_vec   = (-1) ** n_vec
                choose_vec = [comb(ii, ii-1-jj) for jj in n_vec]
                t_vec      = T[ii-2:1:-1]
                power_vec  = (ii -1 - n_vec) ** (n_vec + 1)
                T[ii] = ii * T[ii - 1] * (ii - 1) + np.math.factorial(ii - 1) + np.sum(sign_vec * choose_vec * t_vec * power_vec) # X choose (X-1) = X
        return T

    def get_counts(self):
        # Init
        T_s = self.get_T()
        # counts = np.zeros((self.N + 1, self.N + 1))
        # counts[:, 0] = 1
        # counts[2, 2] = 1
        counts = {(2,2): [1]}
        for ii in range(self.N + 1):
            counts[(ii, 0)] = [1]
        group_dict = {(2,2): [[2]]}
        for NN in range(3, self.N + 1):
            for ss in range(2, NN+1):
                # Computing C(S = ss | ii) without the double-counting correction
                init_term = comb(NN, ss) * T_s[ss]
                # creating new split list
                split_list = []
                count_list = []
                for sigma in range(2, ss+1):
                    key = (sigma, NN-ss)
                    if NN-ss == 0 and sigma > 2:
                        break
                    # Getting sum components
                    temp_components = counts.get((sigma, NN-ss), [0])
                    # Updating split counts
                    group_count = group_dict.get(key, [[]])
                    # appending split list to the total list if needed
                    for gg in group_count:
                        if temp_components[0] > 0:
                            split_list.append(gg + [ss])
                            group_dict[(ss, NN)] = split_list
                            count_list += [temp_components[0]]
                counts[(ss, NN)] = [init_term * cc for cc in count_list] if len(count_list) > 0 else [0]
        self.counts     = counts
        self.group_dict = group_dict

    def correct_double_counts(self):
        counts           = self.counts
        group_dict       = self.group_dict
        corrected_counts = {}
        for key in counts:
            group_count    = group_dict.get(key, [[]])
            sum_components = counts[key]
            corrected_temp = []
            for ii, sum_component in enumerate(sum_components):
                counter_dict = Counter(group_count[ii])
                corrected_temp += [sum_component]
                for split_key in counter_dict:
                    corrected_temp[-1] = corrected_temp[-1]/ np.math.factorial(counter_dict[split_key])
            corrected_counts[key] = sum(corrected_temp)
        self.corrected_counts = corrected_counts

    def get_expected_size(self):
        cumsum   = 0
        expected = 0
        for ii in range(1, self.N+1):
            key = (ii, self.N)
            cumsum += self.corrected_counts[key]
            expected += ii * self.corrected_counts[key]
        self.expected_size = expected / cumsum

Running the following code for N=9 results in:

exp_size = ExpectedSize(9)
exp_size.get_counts()
exp_size.correct_double_counts()
print(exp_size.corrected_counts)

(2, 2): 1.0,
(2, 3): 0, (3, 3): 8.0,
(2, 4): 3.0, (3, 4): 0, (4, 4): 78.0,
(2, 5): 0, (3, 5): 80.0, (4, 5): 0, (5, 5): 944.0,
(2, 6): 15.0, (3, 6): 640.0, (4, 6): 1170.0, (5, 6): 0, (6, 6): 13800.0,
(2, 7): 0, (3, 7): 840.0, (4, 7): 21840.0, (5, 7): 19824.0, (6, 7): 0, (7, 7): 237432.0,
(2, 8): 105.0, (3, 8): 17920.0, (4, 8): 229320.0, (5, 8): 422912.0, (6, 8): 386400.0, (7, 8): 0, (8, 8): 4708144.0,
(2, 9): 0, (3, 9): 153440.0, (4, 9): 786240.0, (5, 9): 9634464.0, (6, 9): 9273600.0, (7, 9): 8547552.0, (8, 9): 0, (9, 9): 105822432.0}

We can see that the counts match the brute force computations of OP (the keys are (S,N).

Counts to expected value

Now the expectation computation can be done with ease using the get_expected_size function

Examples

N=2 --> E[S|N] = 2
N=3 --> E[S|N] = 3
N=4 --> E[S|N] = 3.925925925925926 ; 126 / 27 = 3.925925926
N=5 --> E[S|N] = 4.84375 ; 155 / 32 = 4.84375
N=6 --> E[S|N] = 5.72352 ; 17886 / 3125 = 5.72352
N=7 --> E[S|N] = 6.612311385459534 ; 38563 / 5832 = 6.612311385
N=8 --> E[S|N] = 7.471092584115219 ; 6152766 / 823543 = 7.471092584
N=9 --> E[S|N] = 8.342072010040283 ; 17494593 / 2097152 = 8.34207201
N=10 --> E[S|N] = 9.189007166835722
N=11 --> E[S|N] = 10.047272752
N=12 --> E[S|N] = 10.891595346262637

And we can see that the overall results are accurate w.r.t the brute force results up to precision point of Python. I this is a sufficient solution.

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

I'm going to start with some assumptions regarding the problem:

Meaning of connected component: consider the transformation of the graph where each edge becomes undirected. Two vertices are considered connected if there is a path between them in the transformed (undirected) graph.
If there is a (directed) edge (i,j) linking two vertices i, j, there can still be another (directed) edge (j,i) starting from j.

C(k,n) will denote the binomial coefficient "n choose k".

Analysis

Let G_i(N) be a subset of i vertices of the graph G(N) (there are C(i,N) such subsets). G_i(N) can define a connected component of size i only if the two following conditions are true:

The edges starting from the vertices of G_i(N) end in an element of G_i(N): event e_inGi
The edges starting from the vertices of G \ G_i(N) end in G \ G_i(N): event e_outGi

These two events (e_inGi and e_outGi) are independent because the edges are chosen independently during the graph construction.

Furthermore, there can be at most one cycle (i.e edges with the shape (i,j),(j,k), ..., (l,m), (m,i)) in G_i. More than one cycle would lead to more than one connected component in the i vertices we consider, since there is only one edge starting from a given vertex. This defines the event e_{2- cycles}.

Thus, the overall probability of having a connected component of size i defined by the subset G_i (event e_i-cc) is:

p(e_i-cc) = p(e_inGi).p(e_{2- cycles}|e_inGi). p(e_outGi)

To have a connected component of size at least i is the same as having a connected component of size i, except this time we remove the constraints imposed by event e_outGi. This defines event e_least,i

Computing the probabilities

The probability to have at least two cycles in an isolated set of K vertices is:

Proof: To define two cycles, we need two set of vertices containing at least two vertices each. Let k₁ and k₂ be the number of vertices in these sets.

Defining a cycle on a set of vertices is equivalent to choosing a circular order and link the vertices. Given a regular order for set 1 (resp. 2) defining a circular order, there are k₁ (resp. k₂) other orders that leads to the same circular order (they belong to the same equivalence class). There are thus (k₁-1)! (resp. k₂-1)!) circular orders for set 1 (resp.2). This statement explains the (X-1)! bit of the formula.

We take i elements among the K we consider (i > 3). We are interested in building two cycles with these i elements (regardless of the remaining K - i elements). Since we ignore the remaining part, we must use a balance factor to compensate for the fact that we are going to count several times the same cycles (hence the 1 + C(i, K-i) bit). We split these i elements in two sets (hence the sum j=2..i/2). We take j elements among the i. Here, there are no leftover element in the i elements. However, in the case where j = i-j, we will count exactly each pair of cycles twice, thus we balance using the Kronecker symbol (delta(i,j) = 1 iif (i = j))

We have our two sets, now we count the number of cycles ((k₁-1)!.k₁-1)! with k₁ = j and k₂ = i-j) and divide them by the number of possible ways the edges can be built on the i vertices ((K-1)**i)

Note: A peer review of this probability would be appreciated.

We can now compute the probability of having a connected component of cardinality K in such a graph of N vertices and edges:

Note: The first term denotes the choice of the K vertices. The odd 1 + C(...) part is a balance factor to account for the fact that given a K-combination X, X will appear on the "N-K" side when considering some of the others K-combinations.

The two following terms represent the mutual isolation between the K chosen vertices and the remaining N-K vertices of the graph.

Similarly, the probability to have a connected component of size at least K is the same as having a connected component of size K except that we remove the term corresponding to the event e_outGi described before:

So far so good. The thing is... There may be more than one connected component in the graph! So we cannot simply average the formula above...

The probability of having a maximum connected component of size i can be expressed using the p_k(N) probabilities: it's the conjunction of the following events:

Having a connected component of cardinality i
Not having a connected component of cardinality at least i+1 in the N-i remaining vertices.

These two events are independent because the connected component of cardinality i is disjoint from the rest of the graph.

Thus, the probability that the maximum connected component is of size K is:

p_max(K,N) = p_CCeq(K,N). (1 - p_CCsup(K+1,N-K))

All that's left to do is to average this probability. The expected size S(N) of the largest connected component in such a random graph with N vertices and edges is:

S(N) = Σ_2≤K≤N K.p_max(K,N)

Quick example

Let N = 5. We have:

p_max(1,5) = 0
p_max(2,5) = 0
p_max(3,5) = C(3,5).(1/2)³.(1/4)² = 10/(2.4³) = 80/(4⁵)
p_max(4,5) = 0
p_max(5,5) = (1 - p_2cycles(5)) = 1 - 80/(4⁵)

Thus, the p_max(k,5) are always 0, except for k = 3 and k = 5. The cycle formula gives 20 couples of 2-cycles/3-cycles over 4⁵ possible arrangements and 15 couples of 2-cycles/2-cycles over 4⁴ possible arrangements. The sum is thus 80/(4⁵), which is exactly p_max(3,5).

I guess you can now use a computer to get the real values from that formula? :)

Disclaimer: I checked the 2-cycles formula for N=3,4,5 by hand and it seems to hold well. However, an external review would be very helpful. Note that these computations have been made under the assumptions found at the beginning of my answer.

Again, thanks to @DavidEisenstat for his useful comment which led to this more rigorous (and hopefully correct) answer.

Judging from the example "((1, 2), (2, 1), (3, 1), (4, 1))" with a connected component size of 4, I don't think the question is about strongly connected components. — Peter de Rivaz, Feb 15 '15 at 17:15
The way the question defined "connected component" is not necessarily a strongly connected component (e.g. it would consider 1->2->3 as a connected component). But that may be an error in the formulation of the question. — interjay, Feb 15 '15 at 17:16
He stated `By connected component, this is defined as the largest cluster/group of nodes where each node in the connected component can reach to/is reachable from every other node in the same connected component`. So to me each node in the connected component must reach every other node in the component... How is that different from a strongly connected component? — Rerito, Feb 15 '15 at 17:17
`can reach to/is reachable from` implies that either one is fine. A strongly connected component requires both. But now I see that the question also says "Assume the edges are undirected/bidirectional" so I guess OP is confused, and it's hard to know what the original intent of the interview question was. — interjay, Feb 15 '15 at 17:19
@interjay take (1->2->3), 2 can neither reach any node (can't reach 1), nor is reachable by any node in the component (can't be reached by 3)... — Rerito, Feb 15 '15 at 17:20
hopefully i'm right when i say that i'm not referring to *strongly* connected components. for the given example quoted in the comments `((1, 2), (2, 1), (3, 1), (4, 1))`, it means that there are edges 1-2, 2-1, 3-1, 4-1; and if a node is linked to another, it is linked to that same cluster/component that the other node belongs to (if there is one), otherwise form one with these 2 nodes, maybe can look at the bfs method in my linked source code that generated the sample output; so anyway, in this case, the entire 4 nodes are all considered as one single cluster/component/group — Adeline Ho, Feb 15 '15 at 17:28
Is this explanation sufficient? when there is an edge node a-node b, node a and b belong to the same cluster/component/group, and if node b or node a is linked to a c, then all 3: node a,b,c belong to the same cluster/component/group, is my intent. — Adeline Ho, Feb 15 '15 at 17:31
Yea, as shown in my example input for `N = 4`, there can be 2 components for certain edgelists, eg. {(node1-node2) (node3-node4)}; as an edgelist: `{(1,2),(2,1),(3,4),(4,3)}` — Adeline Ho, Feb 15 '15 at 17:33
Multiplying probabilities like that without an argument as to why the corresponding events are independent is usually a bad sign. — David Eisenstat, Feb 16 '15 at 14:22
Your assumptions should be correct (in the sense that it's aligned with what I'm trying to convey) — Adeline Ho, Feb 16 '15 at 15:54

Jason L · Answer 3 · 2015-02-17T06:34:44.133

This is not a solution yet but some observations.

Firstly with n edges and n vertices, and each vertex is connected to some other vertex except itself, any strongly connected component of size k in this graph must have exactly k edges. This follows from

any SCC of size k must have at least k-1 edges. In this graph it's not possible for any SCC to have exactly k-1 edges because if that's the case then there must be a vertex in the SCC that's connected to another vertex that's not in the SCC, hence invalidating the assumption that this SCC is of size k.
It is also not possible in this graph that any SCC have more than k+1 edges, because each vertex is connected to exactly one other vertex.

Now suppose we want to count the total number valid graphs constructed in this question, and dp[n][k] is the total number of valid graphs with n vertices and maximum SCC of size k. Then dp[n][k] is

Somehow choose k vertices from n vertices and form an SCC, and
The remaining n-k vertices must not have an SCC of size larger than k.

Therefore the DP formulation becomes:

dp[n][k]=C(n,k)*dp[n-k][k] + 
         C(n,k)*dp[n-k][k-1] +
         C(n,k)*dp[n-k][k-2] + 
         ...
         C(n,k)*dp[n-k][2] +
         C(n,k)*dp[n-k][1]

where C(n,k) is n!/[(n-k)!k!]

(I am not sure if this is right, but hopefully on the right direction)

The total number of valid graphs of n vertices is (n-1)**(n-1). The expected size of the largest SCC can be worked out from the basic principle of expectation.

Google interview algorithm puzzle: expected size of the largest connected component in a random simple graph (N nodes, N edges)?

3 Answers3

Notation

Key observations

Computing T(s)

Implementation

Counts to expected value

Analysis

Computing the probabilities

Quick example