-4

I have a folder of .txt files which I want to store in a hash. Then compare the file against an array of specific words. While counting the amount of times the specific words occur.

jenniem
  • 37
  • 1
  • 1
    What are the specific examples of file contents and comparison hash? What have you tried so far? SO is not a "please do my homework for me" site, which this question very much seems to be. – DVK Nov 16 '10 at 12:13
  • What is your question exactly? :) – Øyvind Skaar Nov 16 '10 at 12:16
  • @Øyvind Skaar - out of sheer and unrelated to Perl curiosity, how does one properly pronounce your name, if you don't mind? – DVK Nov 16 '10 at 12:17
  • @DVK With an Ø ;) Dunno if there's anything in english that sounds exactly like it.. this guy http://answers.yahoo.com/question/index?qid=20100101105722AAsWC6Y suggests the vowel in "bird" and "hurt" – Øyvind Skaar Nov 16 '10 at 12:37
  • @DVK, @Øyvind: Perhaps helpful: [IPA for Swedish and Norwegian](http://en.wikipedia.org/wiki/Wikipedia:IPA_for_Swedish_and_Norwegian) - looks like it's similar to German ö. – Cascabel Nov 16 '10 at 17:44

3 Answers3

2

Note that I use \p{Alpha} because that technically defines a word. You can monkey with the regex to add numbers or make sure that there is one alpha at the beginning or whatever you're likely to need.

Note also that for files consisting of one word per line, the regex is overkill and you should omit it. Just chomp the line and store $_.

use 5.010; # for say
use strict;
use warnings;

my ( %hash );

sub load_words { 
    @hash{ @_ } = ( 0 ) x @_; return; 
}

sub count_words {
    $hash{$_}++ foreach grep { exists $hash{$_} } @_;
}


my $word_regex
    = qr{ (                # start a capture
            \p{Alpha}+     # any sequence of one or more alpha characters
            (?:            # begin grouping of
                ['-]         # allow hyphenated words and contractions
                \p{Alpha}+   # which must be followed by an alpha
            )*             # any number of times
            (?: (?<=s)')?  # case for plural possessives (ht: tchrist)
          )                # end capture
        }x;

# load @ARGV to do <> processing
@ARGV = qw( list of files I take words from );
while ( <> ) {
    load_words( m/$word_regex/g );
}
@ARGV = qw( list of files where I count words );
while ( <> ) { 
    count_words( m/$word_regex/g );
}

# take a look at the hash
say Data::Dumper->Dump( [ \%hash ], [ '*hash' ] );
Axeman
  • 29,660
  • 2
  • 47
  • 102
  • See [this answer](http://stackoverflow.com/questions/4213800/is-there-something-like-a-counter-variable-in-regular-expression-replace/4214173#4214173) for another word-based approach that looks at certain border cases. – tchrist Nov 18 '10 at 13:07
  • @tchrist: Good point about the plural possessives. :D – Axeman Nov 18 '10 at 15:23
  • I’m really *really* glad to see people starting to steer away from writing `[a-z]` in their patterns. It’s just like *so* 1960s! ☹ – tchrist Nov 18 '10 at 16:09
1

Not going to write the code for you, but you could do something like:

  1. Loop all the files (see glob())
  2. Loop all the words in each file (maybe with a regular expression or split()?)
  3. Check each words against a hash of wanted words. If it's there, increment a "counter" hash value as such: $hash{ $word }++ OR you could store all the words in a hash and then grab the ones you want afterwards ..

OR ... there are many ways to do it..

If your files are huge you will have to do it another way

Øyvind Skaar
  • 2,278
  • 15
  • 15
0

So I got it done Using an array of the specific words I wanted to find... HAPPY DAYS :-)

#!/usr/bin/perl
#use strict;
use warnings;
my @words;

my @triggers=(" [kK]ill"," [Aa]ssault", " [rR]ap[ie]"," [dD]rug");
my %hash;

sub count_words {
    print "\n";
}

my $word_regex
    = qr{ (                # start a capture
            \p{Alpha}+     # any sequence of one or more alpha characters
            (?:            # begin grouping of
                ['-]         # allow hyphenated words and contractions
                \p{Alpha}+   # which must be followed by an alpha
            )*             # any number of times
          )                # end capture
        }x;

my @files;
my $dirname = "/home/directory";
opendir(DIR,$dirname) or die "can't opendir $dirname: $!";
while (defined($file = readdir(DIR))) {
     push @files, "$dirname$file";
}    # do something with "$dirname/$file" } 
closedir(DIR);
my @interestingfiles;

foreach $file (@files){

    open FILE, ("<$file") or die "No file";

    foreach $line (<FILE>){
        foreach $trigger (@triggers){
           if($line =~ /$trigger/g){
              push @interestingfiles, "$file\n";
           }
        }
    } 
   close FILE;
}
print @interestingfiles;
tchrist
  • 78,834
  • 30
  • 123
  • 180
jenniem
  • 37
  • 1
  • Why did you comment out `use strict;`? You should *never* do this. Fix the issues that it is revealing instead. – Ether Nov 21 '10 at 17:22