PERL to count non-printable characters

Question

I have 100,000's of files that I would like to analyze. Specifically I would like to calculate the percentage of printable characters from a sample of the file of arbitrary size. Some of these files are from mainframes, Windows, Unix, etc. so it is likely that binary and control characters are included.

I started by using the Linux "file" command, but it did not provide enough detail for my purposes. The following code conveys what I am trying to do, but does not always work.

    #!/usr/bin/perl -n

    use strict;
    use warnings;

    my $cnt_n_print = 0;
    my $cnt_print = 0;
    my $cnt_total = 0;
    my $prc_print = 0;

    #Count the number of non-printable characters
    while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};

    #Count the number of printable characters
    while ($_ =~ m/[[:print:]]/g) {$cnt_print++};

    $cnt_total = $cnt_n_print + $cnt_print;
    $prc_print = $cnt_print/$cnt_total;

    #Print the # total number of bytes read followed by the % printable
    print "$cnt_total|$prc_print\n"

This is a test call that works:

    echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

This is how I intend to call it, and works for one file:

    find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

This does not work correctly:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

Neither does this:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

Instead of executing the script once for EACH line returned by the find, it executes ONCE for ALL the results.

Thanks in advance.

Research so far:

Pipe and XARGS and separators

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem

Clarification(s):
1.) Desired output: If there are 932 files in a directory, the output would be a 932 line list of file names, the total bytes read from the file and the % that were printable characters.
2.) Many of the files are binary. Script needs to handle embedded binary eol or eof sequences.
3.) Many of the files are large, so I would like to only read the first/last xx bytes. I had been trying to use head -c 256 or tail -c 128 to read either the first 256 bytes or the last 128 bytes respectively. Solution could either work in a pipe line or limit bytes within perl script.

`while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};` is better done as `$cnt_n_print += ( () = m/[^[:print:]]/g );` (or better yet, using tr///, only that doesn't support POSIX classes) — ysth, Nov 20 '12 at 22:51
"Better" = faster, more concisely, but uses more memory. Possibly quite a lot more, actually. (A whole string scalar per matching character!) — ikegami, Nov 20 '12 at 22:55

score 4 · Answer 1 · edited May 23 '17 at 11:52

4

The -n option wraps your entire code in a while(defined($_=<ARGV>) { ... } block. This means your my $cnt_print and other variable declarations are repeated for every line of input, essentially resetting all your variable values.

The workaround is to use global variables (declare them with our if you want to keep using use strict), and not to initialize them to 0, as they would be reinitialized for every line of input. You could say something like

our $cnt_print //= 0;

if you don't want $cnt_print and its friends to be undefined for the first line of input.

See this recent question with a similar issue.

edited May 23 '17 at 11:52

Community

1
1

answered Nov 20 '12 at 22:33

mob

117,087
18
149
283

Thanks for the quick reply...As for the "-n" option, the implied while loop is what I want. If I pass the script 172 files, I want 172 distinct outputs (one for each file). Is there a best practice to use either "-n" or an explicit "while"? – Stan Nov 21 '12 at 14:28

ikegami · Answer 2 · 2012-11-20T22:57:55.293

1

You could have find pass you one arg at a time.

find /fct/inbound/trans/ -type f -exec perl script.pl {} \;

But I'd continue passing multiple files at a time, either through xargs, or using GNU find's -exec +.

find /fct/inbound/trans/ -type f -exec perl script.pl {} +

The following code snippets support both.

You can continue reading a line at a time:

#!/usr/bin/perl

use strict;
use warnings;

my $cnt_total   = 0;
my $cnt_n_print = 0;

while (<>) {
    $cnt_total += length;
    ++$cnt_n_print while /[^[:print:]]/g;
} continue {
    if (eof) {
        my $cnt_print = $cnt_total - $cnt_n_print;
        my $prc_print = $cnt_print/$cnt_total;

        print "$ARGV: $cnt_total|$prc_print\n";

        $cnt_total   = 0;
        $cnt_n_print = 0;
    }
}

Or you could read a whole file at a time:

#!/usr/bin/perl

use strict;
use warnings;

local $/;
while (<>) {
    my $cnt_n_print = 0;
    ++$cnt_n_print while /[^[:print:]]/g;

    my $cnt_total = length;
    my $cnt_print = $cnt_total - $cnt_n_print;
    my $prc_print = $cnt_print/$cnt_total;

    print "$ARGV: $cnt_total|$prc_print\n";
}

edited Nov 20 '12 at 22:57

answered Nov 20 '12 at 22:51

ikegami

367,544
15
269
518

Thanks!!! REALLY close, but I think it chokes on binary files, and I need to read only the first X bytes (see clarification above). Also I could only get the GNU -exec to work. Can you help update the script so that it can either work in a linux pipeline with the head/tail command like: a) find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl or something like: b) find /path/to/analyze/ -type f -exec perl script.pl {} first 264 + c) find /path/to/analyze/ -type f -exec perl script.pl {} last 128 + – Stan Nov 27 '12 at 16:29
1

Well, `readline` (`<>`) is not very appropriate for binary files. You'd want `read` instead. Iterate over the files using `for (@ARGV)` and open them yourself. – ikegami Nov 27 '12 at 16:42
Could you include an example that works with the output from a find command? I found this reference but still having problems [Perl File Handling: open, read, write and close files](http://www.perlfect.com/articles/perlfile.shtml) – Stan Nov 27 '12 at 18:22
Sorry, can't help with problems you mention nothing about. – ikegami Nov 28 '12 at 07:41
Thanks for your help. I posted my working solution based on your recommendations. – Stan Nov 28 '12 at 19:17

Stan · Accepted Answer · 2012-11-28T19:38:48.393

Here is my working solution based on the feedback provided.

I would appreciate any further feedback on form or more efficient methods:

    #!/usr/bin/perl

    use strict;
    use warnings;

    # This program receives a file path and name.
    # The program attempts to read the first 2000 bytes.
    # The output is a list of files, the number of bytes
    # actually read and the percent of tbe bytes that are
    # ASCII "printable" aka [\x20-\x7E].

    my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);

    # loop through each file
    foreach(@ARGV) {
       $file_name = shift or die "Pass the file name on the command line.\n";

       # open the file read only with "<" in "<$file_name"
       open(FILE, "<$file_name") or die "Can't open $file_name: $!";

       # open each file in binary mode to handle non-printable characters
       binmode FILE;

       # try to read 2000 bytes from FILE, save the results in $data and the
       # actual number of bytes read in $n_bytes
       $n_bytes = read FILE, $data, 2000;

       $cnt_n_print = 0;
       $cnt_print = 0;

       # count the number of non-printable characters
       ++$cnt_n_print while ($data =~ m/[^[:print:]]/g);

       $cnt_print = $n_bytes - $cnt_n_print;
       $prc_print = $cnt_print/$n_bytes;

       print "$file_name|$n_bytes|$prc_print\n";
       close(FILE);
    }

Here is a sample of how to call the above script:

    find /some/path/to/files/ -type f -exec perl this_script.pl {} +

Here's a list of references I found helpful:

POSIX Bracket Expressions
Opening files in binmode
Read function
Open file read only

@mob @ikegami In further testing, I am finding that this solution skips some of the files in a directory if called using the above listed `find` command. For example, in one directory there are 39 files, but the script only outputs information on 20. If I run the script on each of the files individually, it also works without error for the 19 skipped by using `find`. Do you have any ideas on how to have the script run for all the files in a directory? — Stan, Nov 29 '12 at 21:56
@mob @ikegami @ysth If I call the Perl script from a batch script, it runs for all the files: `find /some/path/to/files/ -type f -print|while read filename do perl /path/to/this_script.pl $filename done` What is the "RIGHT" way of doing this? — Stan, Nov 30 '12 at 16:01

PERL to count non-printable characters

3 Answers3