-1

This is a logic problem I'm not quite equipped to deal with it seems.

I have a data set of paired samples which are each identified by a unique number. A pair indicates a certain level of relatedness between the samples. I want to group these samples such that every member of a group is supported by a pair to every other member of the group.

For example, in the data set below pairs 6, 7, and 28 constitute a group of 3. Pairs 22 through 27 constitute a group of 4, in this case there 4 groups of 3 inside this group which I don't want in the output. More difficult are pairs 10, 11, and 17 which are another group of 3.

Data set:

    [,1] [,2]
 [1,]    6  267
 [2,]    9   10
 [3,]   11   12
 [4,]   79   80
 [5,]   96  570
 [6,]  314  583
 [7,]  314  584
 [8,]  425  426
 [9,]  427  428
[10,]  427  429
[11,]  427  430
[12,]  427  472
[13,]  427  473
[14,]  427  474
[15,]  428  430
[16,]  428  473
[17,]  429  430
[18,]  430  472
[19,]  430  473
[20,]  430  474
[21,]  472  474
[22,]  517  519
[23,]  517  520
[24,]  517  521
[25,]  519  520
[26,]  519  521
[27,]  520  521
[28,]  583  584
[29,]  649  650

A couple things about the data set: the value in column 2 of a pair will always be greater than the value in column 1 and the values in column 1 are in ascending order.

This data set is a simple version of the problem with a limited number of pairs for any one sample (the most is 427 with 6) but more complicated data sets will have any number of levels so I need a solution that is agnostic to this factor. I think recursion of some sort is the way to go and I have been playing around with such in R but am fairly new to the concept and am definitely not getting the right answer.

I can't be the first person ever to do this but I haven't found anything yet, probably because describing the problem is a bit awkward in a search engine.

Anyway, if anyone knows anything about this I would appreciate the help!

Edit -

Expected output (each line is a group):

 [1,]    6  267
 [2,]    9   10
 [3,]   11   12
 [4,]   79   80
 [5,]   96  570
 [6,]  314  583  584
 [7,]  425  426
 [8,]  427  428  430  473
 [9,]  427  429  430
[10,]  427  430  472  474
[11,]  517  519  520  521
[12,]  649  650

To clarify what I mean by each member of a group being supported by a pair to every other member: Group 1 has 2 members therefore requires one supporting pair which is pair 1. Group 6 has three members and therefore requires support from three pairs 6, 7, and 28 proving that each group member pairs with both of the others (this can be visualized as a triangle with each member a vertex and the connecting lines pairs). Group 8 has four members and therefore requires the support of six pairs 9, 11, 13, 15, 16, 19 (you can picture this as a square with each member a vertex and lines between all vertices giving six pairs). I hope this clarifies! Its hard to explain and the solution seems to be non-trivial.

2 Answers2

2

Update 3 add pivoting

This is a suggested optimisation to the module below, which should reduce the number of recursions significantly. Add it to the end of the module code, and replace the loop for my $v ( keys %$p ) in _bron_kerbosh with for my $v ( _choose_pivot($p, $x) )

# Find an element u of P U X such that as many as possible of its
# neighbours fall in P
#
sub _choose_pivot {
    my ( $p, $x ) = @_;

    my @p = keys %$p;
    my @choice = @p;

    for my $u ( @p, keys %$x ) {
        my $nu = $neighbours{$u};
        my %nu = map +( $_ => 1 ), @$nu;
        my @subset = grep { not $nu{$_} } @p;
        @choice = @subset if @subset < @choice;
    }

    @choice;
}

Update 2 with module

Wikipedia describes the Bron-Kerbosch algorithm for finding maximal cliques in a graph. It also says

Although other algorithms for solving the clique problem have running times that are, in theory, better on inputs that have few maximal independent sets, the Bron–Kerbosch algorithm and subsequent improvements to it are frequently reported as being more efficient in practice than the alternatives.

So since CPAN appears to have no clique module that I can find I thought it would be useful to implement it. This is the code. You should copy and save it as Graph/Cliques/Bron_Kerbosch.pm. I shall prepare some tests and put it on CPAN shortly

package Graph::Cliques::Bron_Kerbosch;

use strict;
use warnings;
use v5.8.3;

use Exporter qw/ import /;

our @EXPORT_OK = qw/ get_cliques /;

my ( %neighbours, @cliques );

sub get_cliques {

    my ( $edges ) = @_;
    %neighbours = ();
    @cliques = ();

    for my $edge ( @$edges ) {
        my ( $n1, $n2 ) = @$edge;
        $neighbours{$n1}{$n2} = 1;
        $neighbours{$n2}{$n1} = 1;
    }
    $_ = [ keys %$_ ] for values %neighbours;

    my ( %r, %p, %x );
    $p{$_} = 1 for map @$_, @$edges;

    _bron_kerbosch( \( %r, %p, %x ) );

    @cliques;
}

sub _bron_kerbosch {
    my ( $r, $p, $x ) = @_;

    unless ( %$p or %$x ) {
        push @cliques, [ keys %$r ];
        return;
    }

    for my $v ( keys %$p ) {

        my $nv = $neighbours{$v};

        my %r_ = ( %$r, $v => 1 );
        my %p_ = map { $_ => 1 } _intersect( [ keys %$p ], $nv);
        my %x_ = map { $_ => 1 } _intersect( [ keys %$x ], $nv);

        _bron_kerbosch( \( %r_, %p_, %x_ ) );

        delete $p->{$v};
        $x->{$v} = 1;
    }
}

sub _intersect {
    my ( $aa, $ab ) = @_;
    my %ab = map { $_ => 1 } @$ab;
    grep $ab{$_}, @$aa;
}

1;

And this is the program that drives the module using your own data. get_cliques executes in just under a millisecond on my system

use strict;
use warnings;

use Graph::Cliques::Bron_Kerbosch qw/ get_cliques /;

# Read the data into an array of arrays, converting from the question's R
# output. Each element of @edges contains a pair of nodes of the graph
#
my @edges;
while ( <DATA> ) {
    my @pair = split;
    next unless @pair > 2 and shift( @pair ) =~ /\[/;
    push @edges, \@pair;
}

# Call the utility function to get a list of cliques
#
my @groups = get_cliques( \@edges );

# Extract the hash keys to change the array of hashes into an array of sorted
# arrays, then sort the array first by the size of the clique and then by the
# first value in each group
#
$_ = [ sort { $a <=> $b } @$_ ] for @groups;
@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;
print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;

__DATA__
    [,1] [,2]
 [1,]    6  267
 [2,]    9   10
 [3,]   11   12
 [4,]   79   80
 [5,]   96  570
 [6,]  314  583
 [7,]  314  584
 [8,]  425  426
 [9,]  427  428
[10,]  427  429
[11,]  427  430
[12,]  427  472
[13,]  427  473
[14,]  427  474
[15,]  428  430
[16,]  428  473
[17,]  429  430
[18,]  430  472
[19,]  430  473
[20,]  430  474
[21,]  472  474
[22,]  517  519
[23,]  517  520
[24,]  517  521
[25,]  519  520
[26,]  519  521
[27,]  520  521
[28,]  583  584
[29,]  649  650

output

  6 267
  9  10
 11  12
 79  80
 96 570
425 426
649 650
314 583 584
427 429 430
427 428 430 473
427 430 472 474
517 519 520 521



Update 1

Okay what you have here is known mathematically as a graph, and what you are describing, where every value is connected to every other value, is called a complete graph

Knowing that lets you use Google, and there is a question "Find all complete sub-graphs within a graph" here on Stack Overflow which tells us that a complete subgraph is called a clique, which has its very own set of clique problems, of which yours is "listing all maximal cliques". Wikipedia tells us that "These problems are all hard"!

On this basis I checked CPAN for a clique module and found Graph::Clique which I assumed I would just have to plug in to your question. However it has problems

  • It looks only for cliques of a specific size

  • It's broken, and dies with the message

    Can't use string ("1") as a SCALAR ref while "strict refs" in use
  • Because of a sorting bug, it works only with numeric node names that all have the same number of digits

It also uses a brute-force technique that employs a regex method, which while quite clever is not that fast

As it was a better place to start than nothing I fixed it and added some calling code that checks whether a smaller clique found earlier is a subset of a larger one. The result is this program that seems to do what you want

Note though, that I think your expected data is wrong, as it contains cliques that are subsets of others in your list, as I commented beneath your question. And you can't want to include all subsets, as otherwise your example would list all node pairs instead of just some of them. There are actually seven two-node cliques in your data; [517, 521] isn't one of them because it is a subset of [517, 519, 520, 521]

This program runs in just under six seconds on my system. The algorithm works by looking for cliques of successively larger sizes until none are found. By far the biggest delay here is establishing that there are no cliques with five nodes in your data, which takes around five seconds. Finding all of those with four nodes or less takes less than a second

use strict;
use warnings;

use List::MoreUtils qw/ uniq any all /;

# Read the data into an array of arrays. Each element of @edges contains a
# pair of nodes of the graph
#
my @edges;
push @edges, [ split ] while <DATA>;

# Keep asking for cliques of a larger size until we find none. Remove from
# those already found any that are subsets of new ones
#
my @groups;

for ( my $size = 2; my @cliques = get_cliques( $size, \@edges ); ++$size ) {

    @cliques = map +{ map +( $_ => 1 ), split }, @cliques;

    for ( my $i = 0; $i < @groups; ) {

        my $group = $groups[$i];

        my $subset = any {
            my $clique = $_;
            all { $clique->{$_} } keys %$group;
        } @cliques;

        if ( $subset ) {
            splice @groups, $i, 1;
        }
        else {
            ++$i;
        }
    }

    push @groups, @cliques;
}

# Extract the hash keys to change the array of hashes into an array of sorted
# arrays, then sort the array first by the size of the clique and then by the
# first value in each group
#
$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;
@groups = sort { @$a <=> @$b or $a->[0] <=> $b->[0] } @groups;
print join( ' ', map { sprintf '%3d', $_ } @$_ ), "\n" for @groups;


# This subroutine is based on the non-functional `Graph::Clique` CPAN module
# by Edward Wijaya, <ewijaya@singnet.com.sg>
#
sub get_cliques {

    my ( $k, $edges ) = @_;

    my $string = do {
        my @vertices = sort { $a <=> $b } uniq map @$_, @$edges;
        my @edges = map "$_->[0]-$_->[1]", sort { $a->[0] <=> $b->[0] } @{$edges};
        local $" = ',';              # Fix SO syntax colouring "
        "@vertices;@edges";
    };

    my $regex = join '[^;]+', ('\b(\d+)\b') x $k;
    $regex .= '[^;]*;';
    $regex .= "\n";

    for my $i ( 1 .. $k-1 ) {
        for my $j ( $i+1 .. $k ) {
            $regex .= sprintf '(?=.*\b\g%d-\g%d\b)', $i, $j;
            $regex .= "\n";
        }
    }

    # Backtrack to regain all the identified k-cliques (Credit Mike Mikero)
    my @cliques;
    $regex .= '(?{ push (@cliques, join(" ", map ${$_}, 1..$k) ) })(*FAIL)' . "\n";
    #print $regex, "\n";

    {
        no strict 'refs';
        use re 'eval';
        $string =~ /$regex/x;
    }

    @cliques;
}

__DATA__
6 267
9 10
11 12
79 80
96 570
314 583
314 584
425 426
427 428
427 429
427 430
427 472
427 473
427 474
428 430
428 473
429 430
430 472
430 473
430 474
472 474
517 519
517 520
517 521
519 520
519 521
520 521
583 584
649 650

output

  6 267
  9  10
 11  12
 79  80
 96 570
425 426
649 650
314 583 584
427 429 430
427 430 472 474
427 428 430 473
517 519 520 521



Original post

This is reasonably straightforward once you have tossed back the red herring that every member of each group must be in a pair with every other member. I believe your data is simply structured in a way such that each group is represented by every possible pair within it, and the problem is simply one of gathering together all values that are paired to any other member of each group

This code is perhaps a little dense, but all the work is done within the for loop. Two data structures are maintained in parallel. @groups is an array of hashes whose keys are the members of the group. This is just to keep the members unique even if they are added multiple times. And %group_for is a hash relating each member to the element of @groups into which it has been placed

The for loop processes each pair by looking for a group into which either of the pair has already been placed. If neither have appeared before then a new group (anonymous hash) is pushed onto the array. Finally the %groups_for hash is updated to show where both members have been placed

The output section converts the groups from hashes to arrays, sorts each group, and sorts all groups in order of their first member

use strict;
use warnings;

my @data;

push @data, [ split ] while <DATA>;

my @groups;
my %group_for;

for my $pair ( @data ) {

  my $group = $group_for{$pair->[0]} || $group_for{$pair->[1]};
  push @groups, $group = {} if not $group;

  $group->{$_} = 1 for @$pair;
  $group_for{$_} = $group for @$pair;
}

# Change array of hashes into array of sorted values, sort array
# by first value in each group, and display
#
$_ = [ sort { $a <=> $b } keys %$_ ] for @groups;
@groups = sort { $a->[0] <=> $b->[0] } @groups;

print join(' ', map { sprintf '%3d', $_ } @$_), "\n" for @groups;


__DATA__
  6 267
  9  10
 11  12
 79  80
 96 570
314 583
314 584
425 426
427 428
427 429
427 430
427 472
427 473
427 474
428 430
428 473
429 430
430 472
430 473
430 474
472 474
517 519
517 520
517 521
519 520
519 521
520 521
583 584
649 650

output

  6 267
  9  10
 11  12
 79  80
 96 570
314 583 584
425 426
427 428 429 430 472 473 474
517 519 520 521
649 650
Community
  • 1
  • 1
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Hey, thanks so much man, this is amazing, way beyond what I was even hoping for, really appreciate it! As a logic problem I knew it had to have been solved before I just didn't know what it was called. How did you figure that out? Thanks again! – Max Hargreaves Sep 26 '15 at 13:26
  • @MaxHargreaves: That okay -- it interested me and it looks like the community is missing a CPAN module. Is that your real data, or do you have something more comprehensive that will test my module – Borodin Sep 26 '15 at 13:28
  • @MaxHargreaves: There was a bug I just fixed, whereby `get_cliques` would fail if you ran it multiple times in the same program – Borodin Sep 26 '15 at 13:44
  • I have two further data sets one of 75 pairs and one of 489 pairs. Is there a best way to send them to you, or should I just post them? – Max Hargreaves Sep 26 '15 at 15:52
  • @MaxHargreaves: Just try them please. Let me know whether they work okay for you, and how long the bigger data set takes to run. If anything needs fixing then just put the data up on pastebin.com. Thank you very much – Borodin Sep 26 '15 at 19:06
  • I ran the script on the set of 489 pairs using the unix time command I got 0.774s of real time. I made some larger data sets to see how it scaled 1151 pairs took 2.178s, 1807 pairs took 10.409s. This is where it began to scale hard 2670 took 58.996s, 3854 took > 6min at least (the output froze and I didn't get to see the actual elapsed time). This was running on a single core of an AMD Opteron 6174 which is a 12core server processor at 2200mhz. – Max Hargreaves Sep 27 '15 at 19:46
  • @MaxHargreaves: Thank you very much for this. I have added an optimisation in an update to my solution. Since you seem to be able to create test data conveniently, I wonder if you would be so good as test that optimisation in the region of 1,200 nodes, and see how it compares with the original method? For convenience, the original corresponds to a `_choose_pivot` method of `sub _choose_pivot { keys %{$_[0]} }` – Borodin Sep 28 '15 at 10:37
  • With the modification 1151 pairs took 0.440s, 1807 pairs took 1.717s, 2670 pairs took 2.211s, and 3854 took 2.395s. This optimization definitely earns the name. For fun I decided to see if I could kick it in the teeth but 13783 pairs took just 18.102 seconds. Of course I don't know if the output is full and correct at these large sizes but it definitely works for the small data set. Thanks for your help again! – Max Hargreaves Sep 28 '15 at 16:39
  • @MaxHargreaves: That's really helpful, thank you very much – Borodin Sep 28 '15 at 17:51
1

Here's how I'd contruct the initial groups. I'm not sure I understand correctly the condition about links "to every other member", so I'll update the code after you show us the expected output for the given sample.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my $group_counter = 1;
my %in_group;
my %members;
while (<>) {
    my ($key, $v1, $v2) = split;
    my @groups;
    for my $value ($v1, $v2) {
        if (my $g = $in_group{$value}) {
            # Existing groups to merge, no duplicates.
            push @groups, $g unless @groups && $g == $groups[0]; 
        }
    }

    { 0 => sub { # New group.
          $in_group{$_} = $group_counter for $v1, $v2;
          push @{ $members{$group_counter} }, [ $key, $v1, $v2 ];
          $group_counter++;
      },
      1 => sub { # Add to 1 group.
          $in_group{$_} = $groups[0] for $v1, $v2;
          push @{ $members{ $groups[0] } }, [ $key, $v1, $v2 ];
      },
      2 => sub { # Merge 2 groups, add to the result.
          $in_group{$v2} = $groups[0];
          @in_group{ @$_[1, 2] } = ($groups[0]) x 2 for @{ $members{ $groups[1] } };
          push @{ $members { $groups[0] } },
               @{ delete $members{ $groups[1] } };
      },
    }->{@groups}->();
}

for my $g (keys %members) {
    say join ' ', map $_->[0], @{ $members{$g} };
}

Output (each line represents a group):

[3,]
[2,]
[1,]
[8,]
[29,]
[5,]
[6,] [7,] [28,]
[4,]
[22,] [23,] [24,] [25,] [26,] [27,]
[9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,]
choroba
  • 231,213
  • 25
  • 204
  • 289
  • I added the expected output and tried to explain what I meant by "each member of a group being supported by a pair to every other member". Also I appreciate reply, though it solves a different problem! – Max Hargreaves Sep 26 '15 at 00:38