Counting presence of words within context (near other words)

Question

First of all, thanks for any help you can give me on this question. I have a list of words, in the example below it is a list of colors. Let's call this WORD_LIST_1. I want to count the number of times each word appears in a body of text. I can do this with a simple regular expression. However, I have another list of words that capture context. In the example below the context is a list of pets. Let's call this WORD_LIST_2. I'd like to count the number of times each of the words in WORD_LIST_1 that are within X number of words of any of the words in WORK_LIST_2. My strategy is to extract the matches to WORD_LIST_1 words into an array using a regular expression and then create hash that counts the number of times each word is in this array. I can do easily when the context word (WORD_LIST_2) follows the WORD_LIST_1 word. However, I run into a problem when the WORD_LIST_2 word appears before the WORD_LIST_1 words, specifically when there multiple WORD_LIST_2 words.

Below is the code.

#!/usr/bin/perl -w
#use strict;

@colors = ("red", "blue", "green", "brown");

$WORD_LIST_1 = join("|",@colors);

@pets = ("cat","dog","bird","fish");
$WORD_LIST_2 = join("|",@pets);

#$text1 = "The red haired dog quickly and sharply ran away from the blue nosed cat.";
#$text1 = "The green spotted cat drinks blue water.";
#$text1 = "The brown feathered, green beaked bird flew away.";
$text1 = "The fish with blue fins and red tails.";

@finds = ();
$within_N_words = 4;
@finds = $text1 =~ m/\b(?=($WORD_LIST_1)\W+(?:\w+\W+){0,$within_N_words}?(?:$WORD_LIST_2))\b|\b(?=(?:$WORD_LIST_2)\W+(?:\w+\W+){0,$within_N_words}?($WORD_LIST_1))\b/gi;

@finds = grep defined, @finds;

print "\n\n", join("|", @finds), "\n\n";

Note that the fourth $text1 line has blue and red following fish. But it only returns "blue" and doesn't return "red" too. I've check the first three sentences that are commented out and they appear to be working well.

My approach is based on this page: http://www.regular-expressions.info/near.html

Thoughts I've considered includd using a positive look-behind, but I need to have variable lengths in the look behind.

I've thought about reversing the entire text string and regular expression, then searching again. But this could result in double counting.

I've also thought about searching for each WORD_LIST_1 word in individual regular expersions using some sort of loop. However, this takes a lot of time on my real data as the actual WORD_LIST_1 list is 500 or so words and I have multiple bodies of length text I want to search.

Two other side notes:

(1)the regular expression above occasionally returns empty elements into the @finds array. I can't figure out why. My work around is to use the grep defined line. What is the correct way to address this. Rather, why is my regular expression returning blank elements?

(2) I'm still learning the "proper" way to use PERL. I've commented out use strict in this example as I don't believe in the context I am using perl it makes a difference. I'm sure someone can tell me why this is wrong of me. Good PERL programmers always seem to tell me I shouldn't run perl code without using strict, but no one yet has convinced me it is something I need to worry about. However, I'm open to learning.

I'm not really sure regular expressions are really the tool for this job. At best you end up with some extremely convoluted syntax which is - as you've found - really hard to disentangle. — Sobrique, Jul 01 '15 at 08:57
Also: Why do you distrust the opinion of good programmers? The prevailing opinion is - `strict` and `warnings` should be your very first troubleshooting steps, because they highlight some errors that will "work" but not produce desired results. — Sobrique, Jul 01 '15 at 08:59
http://stackoverflow.com/questions/8023959/why-use-strict-and-warnings — Sobrique, Jul 01 '15 at 09:00

score 1 · Accepted Answer · edited May 23 '17 at 11:51

Well, first off - the text you give... looks like red is more than 4 words away from fish in the first place?

But failing that - the problem is I think because your regex is "consuming" the text on the first match so it doesn't it the second match.

In this, you start to hit the limitations of the regular expression engine - http://www.regular-expressions.info/keep.html

How important is it that you use a single regular expression to do your search? Bear in mind that whilst a regular expression looks quite concise, it can be hard to read and computationally expensive.

I would suggest therefore that your initial suggestion of splitting up your patterns isn't as bad as it sounds - in order to match 'red' and 'blue' on the second example, you need to allow for the conditions that'll permit duplicate matches.

E.g.

 fish cat red red blue blue

How many hits should you get here? You can use something like a hash to count duplicates of words and deduplicate 'relationships' though:

my %matches = (
        $text1 =~ m/
                       \b
                       ($WORD_LIST_2)
                       \W+
                       (?:\w+\W+){0,$within_N_words}?
                       ($WORD_LIST_1)\b
                   /gix
);

print Dumper \%matches;

We match into a hash, because then when we 'insert' paired words, we get key-value pairs:

$VAR1 = {
          'fish' => 'blue'
        };

But it may be of use to know - you can use qr in perl to "compile" a regex and see what you actually end up with.

In your example:

print qr /\b(?=($WORD_LIST_1)\W+(?:\w+\W+){0,$within_N_words}?(?:$WORD_LIST_2))\b|\b(?=(?:$WORD_LIST_2)\W+(?:\w+\W+){0,$within_N_words}?($WORD_LIST_1))\b/;

(?^:\b(?=(red|blue|green|brown)\W+(?:\w+\W+){0,4}?(?:(?^:cat|dog|bird|fish)))\b|\b(?=(?:(?^:cat|dog|bird|fish))\W+(?:\w+\W+){0,4}?(red|blue|green|brown))\b)

The first pattern doesn't match at all. The second does, but only once because it 'eats' the existing patterns.

my @finds2 = ( $text1 =~ m/\b(?:$WORD_LIST_2)\W+(?:\w+\W+){0,$within_N_words}?($WORD_LIST_1)\b/gi )

Finds blue. Drop the 'nongreedy' modifier, and it'll find red. But because your pattern has 'eaten' the preceeding bits, it can't match twice with the g modifier.

I don't think perl will support multi-matching in that context, because if you think about it, the number of comparisons needed quickly gets huge.

I would also offer:

check out the x modifier for writing your regexes when they get long.
You can compile regexes, and it's advantageous when using variables that are effectively static (like you are).

So something like this:

my @pets = qw (cat dog bird fish );
my $WORD_LIST_2 = join( "|", map {quotemeta} @pets );
$WORD_LIST_2 = qr/$WORD_LIST_2/;

my @finds2 = (
    $text1 =~ m/
                   \b
                   (?:$WORD_LIST_2)
                   \W+
                   (?:\w+\W+){0,$within_N_words}?
                   ($WORD_LIST_1)\b
               /gix
);

For 1: Because your capturing is both 'sides' of the alternation, but only one can match. So the one that doesn't returns undef. Split your pattern into two, and you won't have this problem. Or use ?| for branch reset. http://www.effectiveperlprogramming.com/2010/09/use-branch-reset-grouping-to-number-captures-in-alternations/

For 2: Why use strict and warnings?

So I'd suggest ending up with something like:

#!/usr/bin/perl 
use strict;
use warnings;
use Data::Dumper;

my @colors = qw ( red blue green brown );    
my $WORD_LIST_1 = join( "|", map {quotemeta} @colors );
   $WORD_LIST_1 = qr/$WORD_LIST_1/;

my @pets = qw (cat dog bird fish );
my $WORD_LIST_2 = join( "|", map {quotemeta} @pets );
   $WORD_LIST_2 = qr/$WORD_LIST_2/;

my $within_N_words = 4;

while ( my $text1 = <DATA> ) {

    print $text1;

    my %matches = (
        $text1 =~ m/(?|                                       
                        \b                                #word break
                          ($WORD_LIST_2) 
                          \W+
                          (?:\w+\W+){0,$within_N_words}?   #nongreedy 0-N 'words'. 
                          ($WORD_LIST_1) 
                        \b
                      |
                        \b
                            ($WORD_LIST_1) 
                            \W+
                            (?:\w+\W+){0,$within_N_words}?
                            ($WORD_LIST_2)
                        \b
                      )
                    /gix
    );

    print Dumper \%matches;
}

__DATA__
The red haired dog quickly and sharply ran away from the blue nosed cat.
The green spotted cat drinks blue water.
The brown feathered, green beaked bird flew away.
The fish with blue fins and red tails.

This gives us both words and context:

The red haired dog quickly and sharply ran away from the blue nosed cat.
$VAR1 = {
          'blue' => 'cat',
          'red' => 'dog'
        };
The green spotted cat drinks blue water.
$VAR1 = {
          'green' => 'cat'
        };
The brown feathered, green beaked bird flew away.
$VAR1 = {
          'brown' => 'bird'
        };
The fish with blue fins and red tails.
$VAR1 = {
          'fish' => 'blue'
        };

(You can use values to extract just the words).

Counting presence of words within context (near other words)

1 Answers1

Linked