Identifying subarrays in matrices in Perl

Question

I am relatively new to Perl, and I need to make a relatively sophisticated matricial computation and don't know what data structures to use.

Not sure if this is the appropriate forum for this, but say you have following matrix in a multi-dimensional array in Perl:

0.2    0.7    0.2 
0.6    0.8    0.7
0.6    0.1    0.8
0.1    0.2    0.9
0.6    0.3    0.0
0.6    0.9    0.2

I am trying to identify column segments in this Matrix corresponding to continuous values that are higher than a given threshold, e.g. 0.5

For example, if we threshold this matrix, we have:

0    1    0 
1    1    1
1    0    1
0    0    1
1    0    0
1    1    0

If we now focus on the first column:

we can see that there are two continuous segments:

0 1 1 0 1 1

The first track (sequence of ones) starts with index 1 and ends with index 2
The second track (sequence of ones) starts with index 4 and ends with index 5

I would like to detect all such tracks in the original matrix, but I don't know how to proceed or what Perl data structures are most appropriate for this.

Ideally I would like something easy to index, e.g. assuming that we use the variable tracks, I can store the indices for the first column (index 0) as follows:

# First column, first track
$tracks{0}{0}{'start'} = 1; 
$tracks{0}{0}{'end'}   = 2;

# First column, second track
$tracks{0}{1}{'start'} = 4; 
$tracks{0}{1}{'end'}   = 5;

# ...

What are good data structures and/or libraries I can use to approach this problem in Perl?

Thanks! @soulSurfer2010 That's a great idea, although I would prefer to do this using standard Perl. — Amelio Vazquez-Reina, Sep 12 '12 at 20:50
why to reinvent the wheel? if you find PDL too complicated, there are various other matrix related modules over CPAN. — snoofkin, Sep 12 '12 at 20:52
From what I have read, I would go with PDL, believe me, but I am not the sys admin and it's not easy to get them to install anything in the system. I just checked if I had it with `use PDL;` but the compiler complained — Amelio Vazquez-Reina, Sep 12 '12 at 20:55
obviously PDL wont solve you this issue, but trust me, you dont want to store huge matrix's this way (hashes / array refs) in memory..thats assuming you will work with big matrix's. — snoofkin, Sep 12 '12 at 20:57
You don't need your sysadmin's permission to install and use Perl modules (unless maybe you're really really close to your disk quota) -- http://stackoverflow.com/q/251705/168657 — mob, Sep 12 '12 at 22:11

score 2 · Answer 1 · edited Oct 05 '12 at 09:28

I am just giving the algorithmic answer and you can code it in whatever language you like.

Split the problem into subproblems:

Thresholding: depending how you store you input this can be as simple as an iteration over an $n$ dimensional matrix, or a tree/list traversal if your matrices are sparse. This is the easy bit.
The algorithm for finding continuous segments is called 'run-length-encoding'. It takes a sequence with possible duplicates like 1 0 0 1 1 1 1 0 1 and returns another sequence which tells you which element is next, and how many of them are there. So for example the above sequence would be 1 1 0 2 1 4 0 1 1 1. The encoding is unique so if you ever want to invert it you are OK.

The first 1 is there because the original input starts with 1, and first 0 is there because after the 1 there is a 0, and the fourth number is two because there are two consecutive zeros. There are zillions of rle-encoders if you don't want to do your own. Its main purpose is compression and it works reasonably well for that purpose if you have long runs of identical items. Depending on your needs you may have to run it horizontally, vertically and even diagonally.

You find the precise algorithm in all the classical books on data structures and algorithm. I'd suggest Cormen-Leiseron-Rivest-Stein: 'Introduction to Algorithms' first, then Knuth.

Once you get the gist, you can safely 'fuse' the thresholding with RLE to avoid iterating twice over your inputs.

No indeed. Perl was originally going to be called *Pearl*, but that was found to clash with an existing language called *PEARL*, so it was renamed to Perl. Think of the gemstone rather than what you may have heard. Anyway, who rejects a language out of hand because of what its name means? You should give it a good try amd make up your own mind. Ignore the anti-Perl bigots - there is no need to get tribal about a programming language — Borodin, Sep 12 '12 at 21:30

Borodin · Accepted Answer · 2012-09-12T22:41:42.937

This seems to do what you want. I have represented the data in the form you suggested, as the ideal form depends entirely on what you want to do with the result

It works by calculating the list of 0s and 1s from each column, adding barrier values of zero at each end (one in $prev and one in the for list) and then scanning the list for changes between 1 and 0

Every time a change is found, a track start or end is recorded. If $start is undefined then the current index is recorded as the start of a segment, otherwise the current segment ended at one less than the current index. A hash is built with start and end keys, and pushed onto the @segments array.

The final set of nested loops dumps the calculated data in the form you show in the question

use strict;
use warnings;

use constant THRESHOLD => 0.5;

my @data = (
  [ qw/ 0.2    0.7    0.2 / ],
  [ qw/ 0.6    0.8    0.7 / ],
  [ qw/ 0.6    0.1    0.8 / ],
  [ qw/ 0.1    0.2    0.9 / ],
  [ qw/ 0.6    0.3    0.0 / ],
  [ qw/ 0.6    0.9    0.2 / ],
);

my @tracks;

for my $colno (0 .. $#{$data[0]}) {

  my @segments;
  my $start;
  my $prev = 0;
  my $i = 0;

  for my $val ( (map { $_->[$colno] > THRESHOLD ? 1 : 0 } @data), 0 ) {
    next if $val == $prev;
    if (defined $start) {
      push @segments, { start => $start, end=> $i-1 };
      undef $start;
    }
    else {
      $start = $i;
    }
  }
  continue {
    $prev = $val;
    $i++;
  }

  push @tracks, \@segments;
}

# Dump the derived @tracks data
#
for my $colno (0 .. $#tracks) {
  my $col = $tracks[$colno];
  for my $track (0 .. $#$col) {
    my $data = $col->[$track];
    printf "\$tracks[%d][%d]{start} = %d\n", $colno, $track, $data->{start};
    printf "\$tracks[%d][%d]{end} = %d\n", $colno, $track, $data->{end};
  }
  print "\n";
}

output

$tracks[0][0]{start} = 1
$tracks[0][0]{end} = 2
$tracks[0][1]{start} = 4
$tracks[0][1]{end} = 5

$tracks[1][0]{start} = 0
$tracks[1][0]{end} = 1
$tracks[1][1]{start} = 5
$tracks[1][1]{end} = 5

$tracks[2][0]{start} = 1
$tracks[2][0]{end} = 3

score 1 · Answer 3 · answered Sep 12 '12 at 23:09

Lamenting the poor support for multidimensional arrays by Perl, I soon found myself throwing together a small solution of my own. The algorithm is rather similar to Borodins idea, but with a slightly different structure:

sub tracks {
  my ($data) = @_; # this sub takes a callback as argument
  my @tracks;      # holds all found ranges
  my @state;       # is true if we are inside a range/track. Also holds the starting index of the current range.
  my $rowNo = 0;   # current row number
  while (my @row = $data->()) { # fetch new data
    for my $i (0..$#row) {
      if (not $state[$i] and $row[$i]) {
        # a new track is found
        $state[$i] = $rowNo+1; # we have to pass $rowNo+1 to ensure a true value
      } elsif ($state[$i] and not $row[$i]) {
        push @{$tracks[$i]}, [$state[$i]-1, $rowNo-1]; # push a found track into the @tracks array. We have to adjust the values to revert the previous adjustment.
        $state[$i] = 0; # reset state to false
      }
    }
  } continue {$rowNo++}
  # flush remaining tracks
  for my $i (0..$#state) {
    push @{$tracks[$i]}, [$state[$i]-1, $rowNo-1] if $state[$i]
  }
  return @tracks;
}

@state doubles as a flag indicating if we are inside a track and as a record for the track starting index. In the state and tracks arrays, the index indicates the current column.

As a data source, I used an external file, but this can be easily plugged into anything, e.g. a preexisting array. The only contract is that it must return an arbitrary sequence of true and false values and the empty list when no further data is available.

my $limit = 0.5
my $data_source = sub {
  defined (my $line = <>) or return (); # return empty list when data is empty
  chomp $line;
  return map {$_ >= $limit ? $_ : 0} split /\s+/, $line; # split the line and map the data to true and false values
};

With the data you gave copy-pasted as input, I get the following printout as output (printing code omitted):

[ [1 2], [4 5] ]
[ [0 1], [5 5] ]
[ [1 3] ]

With your structure, this would be

$tracks[0][0][0] = 1;
$tracks[0][0][1] = 2;

$tracks[0][1][0] = 4;
...;

If this is modified to a hash, further data like the original value could be incorporated.

Identifying subarrays in matrices in Perl

3 Answers3