1

How do I use keywords from an array in an regex to search a files.

I'm trying to look at a text file and see if and where the keywords appear. There are two files keywords.txt

keyword.txt
word1
word2
word3

filestosearchon.txt
a lot of words that go on and one and contain linebreaks and linebreaks (up to 100000   characters)

I would like to find the keyword and the position of the match. This works for one word but I am unable to figure out how to iterate the keywords on the regex.

#!/usr/bin/perl

# open profanity list
open(FILE, "keywords.txt") or die("Unable to open file");
@keywords = <FILE>; 
close(FILE);

# open text file
local $/=undef; 
open(txt, "filetosearchon.txt") or die("Unable to open file");
$txt = <txt>;

$regex = "keyword";


push @section,[length($`),length($&),$1]    
while ($txt =~ m/$regex/g);

foreach $element(@section)  
{
print (join(", ",@$element), $regex, "\n");    
}

How can I iterate the keywords from the array over this while loop to get the matched keywords and position?

Appreciate anyhelp. Thanks

kleqkleq
  • 11
  • 1
  • 2
  • If you only need to match whole words from keyword.txt against whole words in filestosearch.txt you might not needed regular expressions. I'd just create a hash with keywords as the keys and 1 as the value. Then attempt to look up each word in filestosearchon.txt in the hash. If the lookup succeeds, there is a match. – Brian Swift Apr 22 '12 at 19:04
  • 1
    @BrianSwift: perhaps not the most efficient solution because it requires one pass over the string, per keyword. A finite automata approach (i.e. a regular expression) only needs one pass. – Li-aung Yip Apr 22 '12 at 19:34
  • @Li-aung Yip: my approach only requires one pass through the input string/file parsing it into words and attempting to lookup each of those words in the hash which uses keywords as keys. A benefit of your approach is the keywords could be regular expression, not just fixed strings. However, using regexp may require syntax to only match whole words so that `sex` doesn't match `misexplain`. – Brian Swift Apr 22 '12 at 20:32
  • @BrianSwift: Whoops, slightly misread your proposed approach. I agree that it only takes one pass to add all the words to a hash, but OP also wants to know where matches occur (if they do.) – Li-aung Yip Apr 23 '12 at 02:23

3 Answers3

3

One way to do this would be to just build a regex containing every word:

(alpha|bravo|charlie|delta|echo|foxtrot|...|zulu)

Perl's regex compiler is pretty smart and will smoosh this down as much as it can, so the regex will be more efficient than you think. See this answer by Tom Christiansen. For example the following regex:

(cat|rat|sat|mat)

Will compile to:

(c|r|s|m)at

Which is efficient to run. This approach probably beats the "search for each keyword in turn" approach because it only needs to make one pass over the input string; the naive approach requires one pass per keyword you want to search for.

By the way; If you're building a profanity filter, as your sample code suggests, remember to account for intentional mis-spellings: 'pron', 'p0rn', etc. Then there's the fun you can have with Unicode!

Community
  • 1
  • 1
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
2

Try grep:

@words = split(/\s+/, $txt);

for ($i = 0; $i < scalar(@words); ++$i) {
    print "word \#$i\n" if grep(/$words[$i]/, @keywords);
}

Would give you the word position in your text string where a keyword was found. This may or may not be more helpful than a character-based position.

mpe
  • 1,000
  • 1
  • 8
  • 25
2

I am not sure what is the output you expect, but something like this could be useful. I save keywords in a hash, read next file, split each line in words and search each one in the hash.

Content of script.pl:

use warnings;
use strict;

die qq[Usage: perl $0 <keyword-file> <search-file>\n] unless @ARGV == 2;

open my $fh, q[<], shift or die $!;

my %keyword = map { chomp; $_ => 1 } <$fh>;

while ( <> ) {
        chomp;
        my @words = split;
        for ( my $i = 0; $i <= $#words; $i++ ) {
                if ( $keyword{ $words[ $i ] } ) {
                        printf qq[Line: %4d\tWord position: %4d\tKeyword: %s\n], 
                                $., $i, $words[ $i ];
                }
        }
}

Run it like:

perl script.pl keyword.txt filetosearchon.txt

And output should be similar to this:

Line:    7      Word position:    7     Keyword: will
Line:    8      Word position:    8     Keyword: the
Line:    8      Word position:   10     Keyword: will
Line:   10      Word position:    4     Keyword: the
Line:   14      Word position:    1     Keyword: compile
Line:   18      Word position:    9     Keyword: the
Line:   20      Word position:    2     Keyword: the
Line:   20      Word position:    5     Keyword: the
Line:   22      Word position:    1     Keyword: the
Line:   22      Word position:   25     Keyword: the
Birei
  • 35,723
  • 2
  • 77
  • 82