1

scenario: I am a Jr. C# developer, but recently (3 days) began learning Perl for batch files. I have a requirement to parse through a text file, extract some key data, then output the key data to a new text file. As seems to always be the case, there are butt loads of fragmented examples on the net regarding how to 'read' from a file, 'write' to a file, 'store' line by line into an array, 'filter' this and that, yadda yadda, but nothing discussing the entire process of read, filter, write. Trying to splice examples from the net together is no good, because none seem to work together as coherent code. Coming from C#, Perl's syntax structure is hella confusing. I just need some advice on this process.

My objective is to parse a text file, single out all lines similar to the one below, by date, and output only the first 8 digits of the 2nd number group and 5 digits from the 3rd number group to a new text file.

11122 20100223454345 ....random text..... [keyword that identifies all the 
entries I need]... random text 0.0034543345 

I know regex is likely the best option, and have most of the expression written, but it does not work in Perl!

Question: Could someone please show a simple (dummy) example of how to read from, filter (using dummy regex) the file, then output the (dummy) results to a new file? I'm not concerned with functional details, I can learn those, I just need the syntax structure Perl uses. For example:

 open(FH, '<', 'dummy1.txt')
 open(NFH, '>', 'dummy2.txt')

 @array; or $dumb;
 while(<FH>) 
 {
    filter each line [REGEX] and shove it into [@array or $dumb scalar]
 } 
 print(join(',', @array)) to dummy2.txt
 close FH;
 close NFH;

Note: For various reasons, I cannot paste my source code in here, sorry. Any help is appreciated.

UPDATE: ANSWER:

Much thanks to all those who provided insight into my issue. After reading through you replies, as well as conducting further research, I learned that there are dozens of ways to accomplish the same task in Perl(which I am not a fan of). In the end, this is how I solved the problem, and IMO it's the cleanest, and most succinct, solution for those having similar struggles. Thanks again for all the help.

      #======================================================================
  # 1. READ FILE:   inputFile.txt
  # 2. CREATE FILE: outputFile.txt
  # 3. WRITE TO:    outputFile.txt IF line matches REGEX constraints
  # 4. CLOSE FILES: outputFile.txt & inputFile.txt
  #==========================================================================

  #1
  $readFile = 'C:/.../.../inputFile.txt';
  open(FH, '<', $readFile) or Error("Could not read file ($!)");

  #2
  $writeFile = 'C:/.../.../outputFile.txt';
  open(NFH, '>', $writeFile) or Error("Cannot write to file ($!)");

  #3
  @lines = <FH>;
  LINE: foreach $line (@lines)
  {
     if ($line =~ m/(201403\d\d).*KEYWORD.*time was (\d+\.\d+)/)
     {
        $date = $1;
        $elapsedtime = $2;
        print NFH "$date,$elapsedtime\n";
     }
  }

  #4
  close NFH;
  close FH;
Josh Campbell
  • 450
  • 10
  • 23

5 Answers5

3
while(<FH>)
{
  # variable $_ contains the current line

  if(m/regex_goes_here/) #by default, the regex match operator m// attempts to match the default $_ variable  
  {  
    #do actions  
  }  
}  

Also note, m/regex/ is the same as /regex/

Refer to:

For capturing variables from regex match, THIS might help

EDIT

If you want a different variable than the default $_, as @Miller suggested, use while($line = <FH>) followed by if($line =~ m/regex_goes_here/)

=~ is the Binding Operator

Community
  • 1
  • 1
dhrumeel
  • 574
  • 1
  • 6
  • 15
  • I don't understand what $_ is. I've read about it, but best I can figure it's some sort of mystical fairy variable that swoops in and does something with something and is never seen from again, all from behind the curtains. Is understanding $_ the key to getting this to work? Sorry, like I said, I'm a C# guy. – Josh Campbell Apr 02 '14 at 21:49
  • Check the perldoc General Variables link in my answer. Several functions use `$_` as the default placeholder variable (when looping over files, arrays etc) – dhrumeel Apr 02 '14 at 21:51
  • Also edited answer for how to avoid using `$_`. Understanding `$_` is not key, but has the potential to improve code readability (and ease of writing too) – dhrumeel Apr 02 '14 at 21:56
3

perlfaq5 - How do I change, delete, or insert a line in a file, or append to the beginning of a file? covers most of the different scenarios for how to use files.

However, I will add to that by saying that always start your scripts with use strict; and use warnings;, and because you're doing file processing, use autodie; will serve you as well.

With that in mind, a quick stub would be the following:

use strict;
use warnings;
use autodie;

open my $infh, '<', 'dummy1.txt';
open my $outfh, '>', 'dummy2.txt';

while (my $line = <$infh>) {
    chomp $line; # Remove \n

    if (Whatever magically processing here) {
        print $outfh, "your new data";
    }
}
Miller
  • 34,962
  • 4
  • 39
  • 60
  • 1
    Always difficult to choose a correct answer when everyone's answer is correct. In the end, you provided some excellent pearls of wisdom (no pun intended). – Josh Campbell Aug 08 '16 at 22:42
2

One tip. Don't explicitly open filehandles to your input and output files. Instead read from STDIN and write to STDOUT. Your program will be far more flexible and easier to use as you'll be able to treat it like a Unix filter.

$ your_filter_program < your_input.txt > your_output.txt

And doing this actually makes your program simpler to write too.

while (<>) { # <> reads from STDIN
  # transform your data (which is in $_) in some way
  ...
  print; # prints $_ to STDOUT
}

You might find the first few chapters of Data Munging with Perl are useful.

Dave Cross
  • 68,119
  • 3
  • 51
  • 97
1
use strict;
use warnings;
use autodie;
use feature qw(say);

use constant {
    INPUT_FILE  => "NAME_OF_INPUT_FILE",
    OUTPUT_FILE => "NAME_OF_OUTPUT_FILE",
    FILTER      => qr/regex_for_line_to_filter/,
};

open my $in_fh, "<", INPUT_FILE;
open my $out_fh, ">", OUTPUT_FILE;

while ( my $line = <$in_fh> ) {
    chomp $line;
    next unless $line =~ FILTER;
    $line =~ s/regular_expression/replacement/;
    say {$out_fh} $line;
}
close $in_file;
close $out_file;

The $in_file is your input file, and $out_fh is your output file. I basically open both, and loop through the input. The chomp removes the \n from the end. I always recommend doing that.

The next goes to the next iteration of the loop unless I match FILTER which is a regular expression matching lines you want to keep. This is identical to:

if ( $line !~ FILTER ) {
    next;
}

I then use the substitution command to get the parts of the line I want, and munge them into the output I want. I maybe better off expanding this a bit. Maybe using split to split up my line into various pieces, the only using the pieces I want. I could then use substr to pull out the substring from the select pieces.

The say command is like print except it automatically adds in a NL on the end. This is how you write a line to a file.

Now, get Learning Perl and read it. If you know any programming. it shouldn't take you more than a week to go through the first half of the book. That should be more than enough to be able to write a program like this. The more complex stuff like references and object orientation might take a bit longer.

On line documentation can be found at http://perldoc.perl.org. You can look up the use statements which are called pragmas over there. Documentation on the individual functions are also available.

David W.
  • 105,218
  • 39
  • 216
  • 337
1

If I understood well, this one liner will do the job:

perl -ane 'print substr($F[1],0,8),"\t",substr($F[-1],0,5),"\n" if /keyword/' in.txt

Assuming in.txt is:

11122 20100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.0034543345
11122 30100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.124543345
11122 40100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.65487
11122 50100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.6215

output:

20100223    0.003
40100223    0.654
Toto
  • 89,455
  • 62
  • 89
  • 125