Update 3 add pivoting
This is a suggested optimisation to the module below, which should reduce the number of recursions significantly. Add it to the end of the module code, and replace the loop for my $v ( keys %$p )
in _bron_kerbosh
with for my $v ( _choose_pivot($p, $x) )
# Find an element u of P U X such that as many as possible of its
# neighbours fall in P
#
sub _choose_pivot {
my ( $p, $x ) = @_;
my @p = keys %$p;
my @choice = @p;
for my $u ( @p, keys %$x ) {
my $nu = $neighbours{$u};
my %nu = map +( $_ => 1 ), @$nu;
my @subset = grep { not $nu{$_} } @p;
@choice = @subset if @subset < @choice;
}
@choice;
}
Update 2 with module
Wikipedia describes the Bron-Kerbosch algorithm for finding maximal cliques in a graph. It also says
Although other algorithms for solving the clique problem have running times that are, in theory, better on inputs that have few maximal independent sets, the Bron–Kerbosch algorithm and subsequent improvements to it are frequently reported as being more efficient in practice than the alternatives.
So since CPAN appears to have no clique module that I can find I thought it would be useful to implement it. This is the code. You should copy and save it as Graph/Cliques/Bron_Kerbosch.pm
. I shall prepare some tests and put it on CPAN shortly
package Graph::Cliques::Bron_Kerbosch;
use strict;
use warnings;
use v5.8.3;
use Exporter qw/ import /;
our @EXPORT_OK = qw/ get_cliques /;
my ( %neighbours, @cliques );
sub get_cliques {
my ( $edges ) = @_;
%neighbours = ();
@cliques = ();
for my $edge ( @$edges ) {
my ( $n1, $n2 ) = @$edge;
$neighbours{$n1}{$n2} = 1;
$neighbours{$n2}{$n1} = 1;
}
$_ = [ keys %$_ ] for values %neighbours;
my ( %r, %p, %x );
$p{$_} = 1 for map @$_, @$edges;
_bron_kerbosch( \( %r, %p, %x ) );
@cliques;
}
sub _bron_kerbosch {
my ( $r, $p, $x ) = @_;
unless ( %$p or %$x ) {
push @cliques, [ keys %$r ];
return;
}
for my $v ( keys %$p ) {
my $nv = $neighbours{$v};
my %r_ = ( %$r, $v => 1 );
my %p_ = map { $_ => 1 } _intersect( [ keys %$p ], $nv);
my %x_ = map { $_ => 1 } _intersect( [ keys %$x ], $nv);
_bron_kerbosch( \( %r_, %p_, %x_ ) );
delete $p->{$v};
$x->{$v} = 1;
}
}
sub _intersect {
my ( $aa, $ab ) = @_;
my %ab = map { $_ => 1 } @$ab;
grep $ab{$_}, @$aa;
}
1;
And this is the program that drives the module using your own data. get_cliques
executes in just under a millisecond on my system
use strict;
use warnings;
use Graph::Cliques::Bron_Kerbosch qw/ get_cliques /;
# Read the data into an array of arrays, converting from the question's R
# output. Each element of @edges contains a pair of nodes of the graph
#
my @edges;
while ( <DATA> ) {
my @pair = split;
next unless @pair > 2 and shift( @pair ) =~ /\[/;
push @edges, \@pair;
}
# Call the utility function to get a list of cliques
#
my @groups = get_cliques( \@edges );
# Extract the hash keys to change the array of hashes into an array of sorted
# arrays, then sort the array first by the size of the clique and then by the
# first value in each group
#
$_ = [ sort { $a <=> $b } @$_ ] for @groups;
@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;
print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;
__DATA__
[,1] [,2]
[1,] 6 267
[2,] 9 10
[3,] 11 12
[4,] 79 80
[5,] 96 570
[6,] 314 583
[7,] 314 584
[8,] 425 426
[9,] 427 428
[10,] 427 429
[11,] 427 430
[12,] 427 472
[13,] 427 473
[14,] 427 474
[15,] 428 430
[16,] 428 473
[17,] 429 430
[18,] 430 472
[19,] 430 473
[20,] 430 474
[21,] 472 474
[22,] 517 519
[23,] 517 520
[24,] 517 521
[25,] 519 520
[26,] 519 521
[27,] 520 521
[28,] 583 584
[29,] 649 650
output
6 267
9 10
11 12
79 80
96 570
425 426
649 650
314 583 584
427 429 430
427 428 430 473
427 430 472 474
517 519 520 521
Update 1
Okay what you have here is known mathematically as a graph, and what you are describing, where every value is connected to every other value, is called a complete graph
Knowing that lets you use Google, and there is a question "Find all complete sub-graphs within a graph" here on Stack Overflow which tells us that a complete subgraph is called a clique, which has its very own set of clique problems, of which yours is "listing all maximal cliques". Wikipedia tells us that "These problems are all hard"!
On this basis I checked CPAN for a clique module and found Graph::Clique
which I assumed I would just have to plug in to your question. However it has problems
Can't use string ("1") as a SCALAR ref while "strict refs" in use
- Because of a sorting bug, it works only with numeric node names that all have the same number of digits
It also uses a brute-force technique that employs a regex method, which while quite clever is not that fast
As it was a better place to start than nothing I fixed it and added some calling code that checks whether a smaller clique found earlier is a subset of a larger one. The result is this program that seems to do what you want
Note though, that I think your expected data is wrong, as it contains cliques that are subsets of others in your list, as I commented beneath your question. And you can't want to include all subsets, as otherwise your example would list all node pairs instead of just some of them. There are actually seven two-node cliques in your data; [517, 521] isn't one of them because it is a subset of [517, 519, 520, 521]
This program runs in just under six seconds on my system. The algorithm works by looking for cliques of successively larger sizes until none are found. By far the biggest delay here is establishing that there are no cliques with five nodes in your data, which takes around five seconds. Finding all of those with four nodes or less takes less than a second
use strict;
use warnings;
use List::MoreUtils qw/ uniq any all /;
# Read the data into an array of arrays. Each element of @edges contains a
# pair of nodes of the graph
#
my @edges;
push @edges, [ split ] while <DATA>;
# Keep asking for cliques of a larger size until we find none. Remove from
# those already found any that are subsets of new ones
#
my @groups;
for ( my $size = 2; my @cliques = get_cliques( $size, \@edges ); ++$size ) {
@cliques = map +{ map +( $_ => 1 ), split }, @cliques;
for ( my $i = 0; $i < @groups; ) {
my $group = $groups[$i];
my $subset = any {
my $clique = $_;
all { $clique->{$_} } keys %$group;
} @cliques;
if ( $subset ) {
splice @groups, $i, 1;
}
else {
++$i;
}
}
push @groups, @cliques;
}
# Extract the hash keys to change the array of hashes into an array of sorted
# arrays, then sort the array first by the size of the clique and then by the
# first value in each group
#
$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;
@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;
print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;
# This subroutine is based on the non-functional `Graph::Clique` CPAN module
# by Edward Wijaya, <ewijaya@singnet.com.sg>
#
sub get_cliques {
my ( $k, $edges ) = @_;
my $string = do {
my @vertices = sort { $a <=> $b } uniq map @$_, @$edges;
my @edges = map "$_->[0]-$_->[1]", sort { $a->[0] <=> $b->[0] } @{$edges};
local $" = ','; # Fix SO syntax colouring "
"@vertices;@edges";
};
my $regex = join '[^;]+', ('\b(\d+)\b') x $k;
$regex .= '[^;]*;';
$regex .= "\n";
for my $i ( 1 .. $k-1 ) {
for my $j ( $i+1 .. $k ) {
$regex .= sprintf '(?=.*\b\g%d-\g%d\b)', $i, $j;
$regex .= "\n";
}
}
# Backtrack to regain all the identified k-cliques (Credit Mike Mikero)
my @cliques;
$regex .= '(?{ push (@cliques, join(" ", map ${$_}, 1..$k) ) })(*FAIL)' . "\n";
#print $regex, "\n";
{
no strict 'refs';
use re 'eval';
$string =~ /$regex/x;
}
@cliques;
}
__DATA__
6 267
9 10
11 12
79 80
96 570
314 583
314 584
425 426
427 428
427 429
427 430
427 472
427 473
427 474
428 430
428 473
429 430
430 472
430 473
430 474
472 474
517 519
517 520
517 521
519 520
519 521
520 521
583 584
649 650
output
6 267
9 10
11 12
79 80
96 570
425 426
649 650
314 583 584
427 429 430
427 430 472 474
427 428 430 473
517 519 520 521
Original post
This is reasonably straightforward once you have tossed back the red herring that every member of each group must be in a pair with every other member. I believe your data is simply structured in a way such that each group is represented by every possible pair within it, and the problem is simply one of gathering together all values that are paired to any other member of each group
This code is perhaps a little dense, but all the work is done within the for
loop. Two data structures are maintained in parallel. @groups
is an array of hashes whose keys are the members of the group. This is just to keep the members unique even if they are added multiple times. And %group_for
is a hash relating each member to the element of @groups
into which it has been placed
The for
loop processes each pair by looking for a group into which either of the pair has already been placed. If neither have appeared before then a new group (anonymous hash) is pushed onto the array. Finally the %groups_for
hash is updated to show where both members have been placed
The output section converts the groups from hashes to arrays, sorts each group, and sorts all groups in order of their first member
use strict;
use warnings;
my @data;
push @data, [ split ] while <DATA>;
my @groups;
my %group_for;
for my $pair ( @data ) {
my $group = $group_for{$pair->[0]} || $group_for{$pair->[1]};
push @groups, $group = {} if not $group;
$group->{$_} = 1 for @$pair;
$group_for{$_} = $group for @$pair;
}
# Change array of hashes into array of sorted values, sort array
# by first value in each group, and display
#
$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;
@groups = sort { $a->[0] <=> $b->[0] } @groups;
print join(' ', map { sprintf '%3d', $_ } @$_), "\n" for @groups;
__DATA__
6 267
9 10
11 12
79 80
96 570
314 583
314 584
425 426
427 428
427 429
427 430
427 472
427 473
427 474
428 430
428 473
429 430
430 472
430 473
430 474
472 474
517 519
517 520
517 521
519 520
519 521
520 521
583 584
649 650
output
6 267
9 10
11 12
79 80
96 570
314 583 584
425 426
427 428 429 430 472 473 474
517 519 520 521
649 650