1

This code grabs a keyword 'fun' from text files that I have and then prints the 20 characters before and after the keyword. However, I also want it to print the previous 2 lines and the next two lines, and I'm not sure how to do that. I wasn't sure if it is easier to change the code with this or just read the whole file at one time.

{my $inputfile = "file";
$searchword = 'fun';
open (INPUT, '<', $inputfile)  or die "fatal error reading the file \n";
while ($line1=<INPUT>)
{  
#read in a line of the file
 if ($line1 =~m/$searchword/i)
 {print "searchword found\n";
  $keepline = $line1;
    $goodline =1;

    $keepline =~/(.{1,20})(fun)(.{1,20})/gi;

    if ($goodline==1)
    {&write_excel};
 $goodline =0;                
 }
  • It reads as though it takes the 20 characters each side of the word 'pledge' whatever `$searchword` is set to? Can you clarify? – Marty Mar 01 '16 at 01:29
  • Also, other than "searchword found" it doesn't print anything that we can see - i.e. presumably `&write_excel` does something but you haven't posted its contents. – Marty Mar 01 '16 at 01:44
  • I'm sorry pledge should have read "fun". Also, this is a sub routine and the write to excel portion is later in my code. I can post it all if that is helpful. – A Roeschley Mar 02 '16 at 13:23

2 Answers2

0

Your code as is seems to

  1. Take 20 chars each side of 'pledge' not $searchword;
  2. Have an unmatched '{' at the start;
  3. Doesn't print any file contents save for &write_excel which we can't examine; and
  4. Has a logic problem in that if $searchword is found, $goodline is unconditionally set to '1' and then tested to see if its '1' and finally reset to '0'

Putting that aside, the question as to whether to read in the whole file depends on your circumstances some what - how big are the files you're going to be searching, does your machine have plenty of memory; is the machine a shared resource and so on. I'm going to presume you can read in the whole file as that's the more common position in my experience (those who disagree please keep in mind (a) I've acknowledge that its debatable; and (b) its very dependant on the circumstances that only the OP knows)

Given that, there are several ways to read in a whole file but the consensus seems to be to go with the module File::Slurp. Given those parameters, the answer looks like this;

#!/usr/bin/env perl
use v5.12;
use File::Slurp;

my $searchword = 'fun';
my $inputfile  = "file.txt";
my $contents   = read_file($inputfile);

my $line = '\N*\n';
if ( $contents =~ /(
       $line?
       $line?
       \N* $searchword \N* \n?
       $line?
       $line?
   )/x) {
  say "Found:\n" . $1 ;
}
else {
  say "Not found."
}

File::Slurp prints a reasonable error message if the file isn't present (or something else goes wrong), so I've left out the typical or die.... Whenever working with regexes - particularly if your trying to match stuff on multiple lines, it pays to use "extended mode" (by putting an 'x' after the final '/') to allow insignificant whitespace in the regex. This allows a clearer layout.

I've also separated out the definition of a line for added clarity which consists of 0, 1 or more non-newlines characters, \N*, followed by a new line, \n. However, if your target is on the first, second, second-last or last line I presume you still want the information, so the requested preceding and following pairs of lines are optionally matched. $line?

Please note that regular expressions are pedantic and there are inevitably 'fine details' that effect the definition of a successful match vs an unwanted match - ie. Don't expect this to do exactly what you want in all circumstances. Expect that you'll have to experiment and tweek things a bit.

Community
  • 1
  • 1
Marty
  • 2,788
  • 11
  • 17
0

I'm not sure I understand your code block (what purpose does "pledge" have? what is &write_excel?), but I can answer your question itself.

First, is this grep command acceptable? It's far faster and cleaner:

grep -i -C2 --color "fun" "file"

The -C NUM flag tells grep to provide NUM lines of context surrounding each pattern match. Obviously, --color is optional, but it may help you find the matches on really long lines.

Otherwise, here's a bit of perl:

#!/usr/bin/perl

my $searchword = "fun";
my $inputfile = "file";

my $blue = "\e[1;34m";    # change output color to blue
my $green = "\e[1;32m";   # change output color to green
my $nocolor = "\e[0;0m";  # reset output to no color

my $prev1 = my $prev2 = my $result = "";

open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while(<INPUT>) {
  if (/$searchword/i) {
    $result .= $prev2 . $prev1 . $_;  # pick up last two lines
    $prev2 = $prev1 = "";             # prevent reusing last two lines
    for (1..2) {                      # for two more non-matching lines
      while (<INPUT>) {               # parse them to ensure they don't match
        $result .= $_;                # pick up this line
        last unless /$searchword/i;   # reset counting if it matched
      }
    }
  } else {
    $prev2 = $prev1;                  # save last line as $prev2
    $prev1 = $_;                      # save current line as $prev1
  }
}
close $inputfile;

exit 1 unless $result;                # return with failure if without matches

$result =~                            # add colors (okay to remove this line)
  s/([^\e]{0,20})($searchword)([^\e]{0,20})/$blue$1$green$2$blue$3$nocolor/g;
print "$result";                      # print the result
print "\n" unless $result =~ /\n\Z/m; # add newline if there wasn't already one

Bug: this assumes that the two lines before and the two lines after are actually 20+ characters. If you need to fix this, it goes in the else stanza.

Adam Katz
  • 14,455
  • 5
  • 68
  • 83