First of all, thanks for any help you can give me on this question. I have a list of words, in the example below it is a list of colors. Let's call this WORD_LIST_1. I want to count the number of times each word appears in a body of text. I can do this with a simple regular expression. However, I have another list of words that capture context. In the example below the context is a list of pets. Let's call this WORD_LIST_2. I'd like to count the number of times each of the words in WORD_LIST_1 that are within X number of words of any of the words in WORK_LIST_2. My strategy is to extract the matches to WORD_LIST_1 words into an array using a regular expression and then create hash that counts the number of times each word is in this array. I can do easily when the context word (WORD_LIST_2) follows the WORD_LIST_1 word. However, I run into a problem when the WORD_LIST_2 word appears before the WORD_LIST_1 words, specifically when there multiple WORD_LIST_2 words.
Below is the code.
#!/usr/bin/perl -w
#use strict;
@colors = ("red", "blue", "green", "brown");
$WORD_LIST_1 = join("|",@colors);
@pets = ("cat","dog","bird","fish");
$WORD_LIST_2 = join("|",@pets);
#$text1 = "The red haired dog quickly and sharply ran away from the blue nosed cat.";
#$text1 = "The green spotted cat drinks blue water.";
#$text1 = "The brown feathered, green beaked bird flew away.";
$text1 = "The fish with blue fins and red tails.";
@finds = ();
$within_N_words = 4;
@finds = $text1 =~ m/\b(?=($WORD_LIST_1)\W+(?:\w+\W+){0,$within_N_words}?(?:$WORD_LIST_2))\b|\b(?=(?:$WORD_LIST_2)\W+(?:\w+\W+){0,$within_N_words}?($WORD_LIST_1))\b/gi;
@finds = grep defined, @finds;
print "\n\n", join("|", @finds), "\n\n";
Note that the fourth $text1 line has blue and red following fish. But it only returns "blue" and doesn't return "red" too. I've check the first three sentences that are commented out and they appear to be working well.
My approach is based on this page: http://www.regular-expressions.info/near.html
Thoughts I've considered includd using a positive look-behind, but I need to have variable lengths in the look behind.
I've thought about reversing the entire text string and regular expression, then searching again. But this could result in double counting.
I've also thought about searching for each WORD_LIST_1 word in individual regular expersions using some sort of loop. However, this takes a lot of time on my real data as the actual WORD_LIST_1 list is 500 or so words and I have multiple bodies of length text I want to search.
Two other side notes:
(1)the regular expression above occasionally returns empty elements into the @finds array. I can't figure out why. My work around is to use the grep defined line. What is the correct way to address this. Rather, why is my regular expression returning blank elements?
(2) I'm still learning the "proper" way to use PERL. I've commented out use strict in this example as I don't believe in the context I am using perl it makes a difference. I'm sure someone can tell me why this is wrong of me. Good PERL programmers always seem to tell me I shouldn't run perl code without using strict, but no one yet has convinced me it is something I need to worry about. However, I'm open to learning.