First of all, if When you find yourself adding an integer suffix to variable names, think I should have used an array.
Therefore, first I am going to put the wordsets in an array of arrayrefs. That will help identify where the matched word came from.
Second, I am going to use Regex::PreSuf to make a pattern out of each word list because I always forget the right way to do that.
Third note that using \b in regex patterns can lead to surprising results. So, instead, I am going to split up the content into individual sequences of \w
characters.
Fourth, you say "I also have a variable that contains the content from a web page (using WWW::Mechanize)". Do you want to match words in the comments? In title
attributes? If you don't, you should parse the HTML document either to extract full plain text or to restrict the match to within a certain element or set of elements.
Then, grep
from the list of words in the text those that are in a wordset and map them to the wordset they matched.
#!/usr/bin/env perl
use strict; use warnings;
use Regex::PreSuf qw( presuf );
my @wordsets = (
[ qw( DOG CAT HAMSTER ) ],
[ qw( DONKEY FOX PIG HORSE ) ],
[ qw( RHINO LION ELEPHANT ) ],
);
my @patterns = map {
my $pat = presuf(@$_);
qr/\A($pat)\z/;
} @wordsets;
my $content = q{Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis ELEPHANT exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in HAMSTER
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in DONKEY qui officia deserunt mollit anim id
est laborum.};
my @contents = split /\W+/, $content;
use YAML;
print Dump [
map {
my $i = $_;
map +{$_ => $i },
grep { $_ =~ $patterns[$i] } @contents
} 0 .. $#patterns
];
Here, grep { $_ =~ $patterns[$i] } @contents
extracts the words from @contents
which are in the given wordset. Then, map +{$_ => $i }
maps those words to the wordset from which they came. The outer map
just loops over each wordset pattern.
Output:
---
- HAMSTER: 0
- DONKEY: 1
- ELEPHANT: 2
That is, you get a list of hashrefs where the key in each hashref is the word that was found and the value is the wordset that matched.