I have 100,000's of files that I would like to analyze. Specifically I would like to calculate the percentage of printable characters from a sample of the file of arbitrary size. Some of these files are from mainframes, Windows, Unix, etc. so it is likely that binary and control characters are included.
I started by using the Linux "file" command, but it did not provide enough detail for my purposes. The following code conveys what I am trying to do, but does not always work.
#!/usr/bin/perl -n
use strict;
use warnings;
my $cnt_n_print = 0;
my $cnt_print = 0;
my $cnt_total = 0;
my $prc_print = 0;
#Count the number of non-printable characters
while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};
#Count the number of printable characters
while ($_ =~ m/[[:print:]]/g) {$cnt_print++};
$cnt_total = $cnt_n_print + $cnt_print;
$prc_print = $cnt_print/$cnt_total;
#Print the # total number of bytes read followed by the % printable
print "$cnt_total|$prc_print\n"
This is a test call that works:
echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl
This is how I intend to call it, and works for one file:
find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl
This does not work correctly:
find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl
Neither does this:
find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl
Instead of executing the script once for EACH line returned by the find, it executes ONCE for ALL the results.
Thanks in advance.
Research so far:
Pipe and XARGS and separators
http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html
http://en.wikipedia.org/wiki/Xargs#The_separator_problem
Clarification(s):
1.) Desired output: If there are 932 files in a directory, the output would be a 932 line list of file names, the total bytes read from the file and the % that were printable characters.
2.) Many of the files are binary. Script needs to handle embedded binary eol
or eof
sequences.
3.) Many of the files are large, so I would like to only read the first/last xx bytes. I had been trying to use head -c 256
or tail -c 128
to read either the first 256 bytes or the last 128 bytes respectively. Solution could either work in a pipe line or limit bytes within perl script.