3

I have a problem I am hoping someone can help with (greatly simplified for the purposes of explaining what I am trying to do)...

I have three different arrays:

my @array1 =  ("DOG","CAT","HAMSTER");
my @array2 =  ("DONKEY","FOX","PIG", "HORSE");
my @array3 =  ("RHINO","LION","ELEPHANT");

I also have a variable that contains the content from a web page (using WWW::Mechanize):

my $variable = $r->content;

I now want to see if any of the elements in each of the arrays are found in the variable, and if so which array it comes from:

e.g

if ($variable =~ (any of the elements in @array1)) {
     print "FOUND IN ARRAY1";
} elsif ($variable =~ (any of the elements in @array2)) { 
     print "FOUND IN ARRAY2";
} elsif ($variable =~ (any of the elements in @array3)) {
     print "FOUND IN ARRAY3";
}

What is the best way to go about doing this using the arrays and iterating through each element in the arrays? Is there a better way this can be done?

your help is much appreciated, thanks

yonetpkbji
  • 1,019
  • 2
  • 21
  • 35

6 Answers6

7

You can make a regex out of the array elements, but you'll most likely want to disable meta characters and make sure you do not get partial matches:

my $rx = join('\b|\b', map quotemeta, @array1);

if ($variable =~ /\b$rx\b/) {
    print "matched array 1\n";
}

If you do want to get partial matches, such as FOXY below, simply remove all the \b sequences.

Demonstration:

use strict;
use warnings;

my @array1 =  ("DOG","CAT","HAMSTER");
my @array2 =  ("DONKEY","FOX","PIG", "HORSE");
my @array3 =  ("RHINO","LION","ELEPHANT");

my %checks = (
    array1 => join('\b|\b', map quotemeta, @array1),
    array2 => join('\b|\b', map quotemeta, @array2),
    array3 => join('\b|\b', map quotemeta, @array3),
);

while (<DATA>) {
    chomp;
    print "The string: '$_'\n";
    for my $key (sort keys %checks) {
        print "\t";
        if (/\b$checks{$key}\b/) {
            print "does";
        } else {
            print "does not";
        }
        print " match $key\n";
    }
}

__DATA__
A DOG ATE MY RHINO
A FOXY HORSEY

Output:

The string: 'A DOG ATE MY RHINO'
        does match array1
        does not match array2
        does match array3
The string: 'A FOXY HORSEY'
        does not match array1
        does not match array2
        does not match array3
TLP
  • 66,756
  • 10
  • 92
  • 149
  • putting the \b in the join will disable aho-corasick matching, I believe; just do `\b(?:$rx)\b` instead – ysth Apr 12 '13 at 19:47
  • @ysth Aho what? What's that in english? – TLP Apr 12 '13 at 19:49
  • a matching algorithm that perl sometimes will use for | alternated fixed strings; without it, basically each | alternative will be tried at each position in the string until one matches. http://en.wikipedia.org/wiki/Aho-Corasick – ysth Apr 12 '13 at 20:33
2
my $re1 = join '|', @array1;
say "found in array 1" if $variable =~ /$re1/;

Repeat for each additional array (or use an array of regexes and an array of arrays of terms).

Dave Sherohman
  • 45,363
  • 14
  • 64
  • 102
  • what if one of the contents of @array has special characters, like a '|'? – imran Apr 11 '13 at 14:03
  • @imran: In that case, `my $re1 = join '|', map { "\Q$_\E" } @array1;` – Dave Sherohman Apr 11 '13 at 14:06
  • 1
    You also have to worry about partial matches. – TLP Apr 11 '13 at 14:06
  • @TLP: The spec is to see whether any of the array elements are found within the variable, which is stated to be the content of a web page. Partial matches are implicitly acceptable, given that none of the target patterns begin with `` or end with ``. – Dave Sherohman Apr 11 '13 at 14:11
2

First of all, if When you find yourself adding an integer suffix to variable names, think I should have used an array.

Therefore, first I am going to put the wordsets in an array of arrayrefs. That will help identify where the matched word came from.

Second, I am going to use Regex::PreSuf to make a pattern out of each word list because I always forget the right way to do that.

Third note that using \b in regex patterns can lead to surprising results. So, instead, I am going to split up the content into individual sequences of \w characters.

Fourth, you say "I also have a variable that contains the content from a web page (using WWW::Mechanize)". Do you want to match words in the comments? In title attributes? If you don't, you should parse the HTML document either to extract full plain text or to restrict the match to within a certain element or set of elements.

Then, grep from the list of words in the text those that are in a wordset and map them to the wordset they matched.

#!/usr/bin/env perl

use strict; use warnings;

use Regex::PreSuf qw( presuf );

my @wordsets = (
    [ qw( DOG CAT HAMSTER ) ],
    [ qw( DONKEY FOX PIG HORSE ) ],
    [ qw( RHINO LION ELEPHANT ) ],
);

my @patterns = map {
    my $pat = presuf(@$_);
    qr/\A($pat)\z/;
} @wordsets;

my $content = q{Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis ELEPHANT exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in HAMSTER
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in DONKEY qui officia deserunt mollit anim id
est laborum.};

my @contents = split /\W+/, $content;

use YAML;
print Dump [
    map {
        my $i = $_;
        map +{$_ => $i },
        grep { $_ =~ $patterns[$i] } @contents
    } 0 .. $#patterns
];

Here, grep { $_ =~ $patterns[$i] } @contents extracts the words from @contents which are in the given wordset. Then, map +{$_ => $i } maps those words to the wordset from which they came. The outer map just loops over each wordset pattern.

Output:

---
- HAMSTER: 0
- DONKEY: 1
- ELEPHANT: 2

That is, you get a list of hashrefs where the key in each hashref is the word that was found and the value is the wordset that matched.

Community
  • 1
  • 1
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
0

EDIT: I think you could use perl's map function, something like this:

@a1matches = map { $variable =~ /$_/ ? $_ : (); } @array1;
print "FOUND IN ARRAY1\n" if $#a1matches >= 0;

@a2matches = map { $variable =~ /$_/ ? $_ : (); } @array2;
print "FOUND IN ARRAY2\n" if $#a2matches >= 0;

@a3matches = map { $variable =~ /$_/ ? $_ : (); } @array3;
print "FOUND IN ARRAY3\n" if $#a3matches >= 0;

A fun side effect is that @a1matches contain the elements of @array1 that were in $variable.

Rob I
  • 5,627
  • 2
  • 21
  • 28
  • 1
    That will never return false unless `$variable` contains a false value. And also, your check is reversed. – TLP Apr 11 '13 at 13:55
  • 2
    You have it backwards. He wants to see whether any of the array elements are in `$variable`, not whether `$variable` is in any of the arrays. – Dave Sherohman Apr 11 '13 at 13:55
  • what's with the returns inside the map? Quite unconventional. – imran Apr 11 '13 at 14:06
  • @imran you're right, I did not mean that (in fact it will not work) - I used the ternary operator now. Thanks everyone! – Rob I Apr 11 '13 at 14:09
  • You can still do this with grep: `@a1matches = grep { $variable =~ /\Q$_\E/ } @array1;` – imran Apr 11 '13 at 14:13
0

I assume $variable is not an array, in which case use a foreach statement.

foreach my $item (@array1) {
    if ($item eq $variable) {
        print "FOUND IN ARRAY1";
    }
}

and repeat the above for each array, i.e. array2, array3...

Kenneth P. Hough
  • 577
  • 2
  • 8
  • 25
0

Regexp::Assemble may be helpful if you like to use a module. It allows to assemble strings of regular expressions into one regular expression matching all the individual regular expressions.

katastrophos
  • 521
  • 4
  • 17