Need help executing perl tokening script

Question

I'm a Perl amateur. Recently I was given a Perl script that takes a text file and removes all formatting except for the individual words follows by a space. The problem is that the script is unclear how to input a file location. I've set up some code to run through an entire directory of files, but haven't been able to get the code to execute yet. I'll post the original code followed by what I added. Thanks for the help!

Original:

while(<>) {
    chomp;
    s/\<[^<>]*\>//g;           # eliminate markup
    tr/[A-Z]/[a-z]/;           # downcase

     s/([a-z]+|[^a-z]+)/\1 /g;  # separate letter strings from other types of sequences

    s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

    s/[0-9]+/\#/g;             # map numerical strings to #

    s/\s+/ /g;                 # these three lines clean up white space (so it's always exactly one space between words, no newlines
    s/^\s+//;
    s/\s+$/ /;


    print if(m/\S/);           # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline

My Changes:

#!/usr/local/bin/perl

$dirtoget="1999_txt/";
opendir(IMD, $dirtoget) || die("Cannot open directory");
@thefiles= readdir(IMD); #
closedir(IMD);
    foreach $f (@thefiles)
    {
        unless ( ($f eq ".") || ($f eq "..") )
        {
            $fr="$dirtoget$f";
            open(FILEREAD, "< $fr");

$x="";
while($line = <FILEREAD>) { $x .= $line; } # read the whole file into one string
close FILEREAD;

print "$x/n";   
while(<$x>) {
    chomp;
    s/\<[^<>]*\>//g;           # eliminate markup
    tr/[A-Z]/[a-z]/;           # downcase

    s/([a-z]+|[^a-z]+)/\1 /g;  # separate letter strings from other types of sequences

    s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

    s/[0-9]+/\#/g;             # map numerical strings to #

    s/\s+/ /g;                 # these three lines clean up white space (so it's always exactly one space between words, no newlines
    s/^\s+//;
    s/\s+$/ /;


    print if(m/\S/);           # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline

}}

It looks like your old code was run like this: `$ perl script.pl inputfile.txt >outputfile.txt` because it reads the file whose name has been supplied as an argument and `print`s the results. So it would make sense to send that output to a new file. If you wanted it to go through more files, write a wrapper in either Perl, or maybe bash. — simbabque, Jun 30 '15 at 11:36
http://stackoverflow.com/questions/8023959/why-use-strict-and-warnings — TLP, Jun 30 '15 at 11:56
Instead of all these complicated replacements, why not just capture letters (and numbers if you want the `###`)? E.g. `my @words = $document =~ /[a-z]+|\d+/gi` — TLP, Jun 30 '15 at 12:15

nowox · Answer 1 · 2015-06-30T11:49:37.610

You don't really need to edit the original script to apply it to the content of a directory. The shell will be your friend in this case.

Your first script will read every files passed as arguments or, as default, the content of stdin. In other terms you can call your original script like this:

$ ./script file > output
$ cat file | ./script | less

If you want to parse all the files you can still use your shell:

$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"

It might be clearer with this short example:

Consider a similar script of yours named script:

#!/usr/bin/perl 
while(<>) {
   chomp
   print ">$_<\n";
}
print "\n";

Now, from you shell you can do:

$ mkdir foo && cd foo
$ echo -e "Hello\nYou\nI am A" >> a.txt
$ echo -e "Hello\nYou\nI am A" >> b.txt

$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"

$ ls 
a.txt  a.txt.out  b.txt  b.txt.out  script  script.out
$ cat a.txt.out
>Hello<
>You<
>I am A<

@Borodin, I thought its question was how to simply use its original script and apply it to each file into a specific folder. I pretend that he does not need to modify the script, but just use a shell script over its original Perl program. I am wrong? — nowox, Jun 30 '15 at 14:42

score 1 · Answer 2 · answered Jun 30 '15 at 12:26

Your main problem is that you are opening each file and reading its contents into $x, and then passing $x as a file handle to the original loop. But it's not a file handle -- it's just plain text. If you just omit the reading of the file then your code is close to working

I think this will do as you ask. It uses glob in preference to opendir/readdir because it is more concise

#!/usr/local/bin/perl

use strict;
use warnings;

while ( my $file = glob '1999_txt/*' ) {

    next unless -f $file;

    open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};

    while ( <$fh> ) {
        chomp;

        s/<[^<>]*>//g;             # Remove HTML tags
        tr/A-Z/a-z/;               # downcase

        s/([a-z]+|[^a-z]+)/$1 /g;  # separate letter strings from other types of sequences

        s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

        s/[0-9]+/#/g;              # map numerical strings to #

        s/\s+/ /g;                 # these three lines clean up whitespace
        s/^\s+//;                  # so it's always exactly one space
        s/\s+$//;                  # between words, no newlines

        print if /\S/;             # print what's left if it's not just whitespace
    }

    print "\n"; # final newline, so whole doc is on one line that ends in newline
}

Need help executing perl tokening script

2 Answers2