2

Two Perl scripts, using different input record separators, work together to convert a LaTeX file into something easily searched for human-readable phrases and sentences. Of course, they could be wrapped together by a single shell script. But I am curious whether they can be incorporated into a single Perl script.

The reason for these scripts: It would be a hassle to find "two three" inside short.tex, for instance. But after conversion, grep 'two three' will return the first paragraph.

For any LaTeX file (here, short.tex), the scripts are invoked as follows.

cat short.tex | try1.pl | try2.pl

try1.pl works on paragraphs. It gets rid of LaTeX comments. It makes sure that each word is separated from its neighbors by a single space, so that no sneaky tabs, form feeds, etc., lurk between words. The resulting paragraph occupies a single line, consisting of visible characters separated by single spaces --- and at the end, a sequence of at least two newlines.

try2.pl slurps the entire file. It makes sure that paragraphs are separated from each other by exactly two newlines. And it ensures that the last line of the file is non-trivial, containing visible character(s).

Can one elegantly concatenate two operations such as these, which depend on different input record separators, into a single Perl script, say big.pl? For instance, could the work of try1.pl and try2.pl be accomplished by two functions or bracketed segments inside the larger script?

Incidentally, is there a Stack Overflow keyword for "input record separator"?

###File try1.pl:

#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = ""; # input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
while (<>) {
    s/[\x25].*\n/ /g; # remove all LaTeX comments. They start with %
    s/[\t\f\r ]+/ /g; # collapse each "run" of whitespace to one single space
    s/^\s*\n/\n/g; # any line that looks blank is converted to a pure newline;
    s/(.)\n/$1/g; # Any line that does not look blank is joined to the subsequent line
    print;
    print "\n\n"; # make sure each paragraph is separated from its fellows by newlines
}

###File try2.pl:

#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = undef; # input record separator: entire text or file is a single record.
while (<>) {
    s/[\n][\n]+/\n\n/g;    # exactly 2 blank lines separate paragraphs. Like cat -s
    s/[\n]+$/\n/; # last line is nontrivial; no blank line at the end
    print;
}

###File short.tex:

\paragraph{One}
% comment
two % also 2
three % or 3

% comment
% comment

% comment
% comment

% comment

% comment

So they said%
that they had done it.

% comment
% comment
% comment





Fleas.

% comment

% comment




After conversion:

\paragraph{One} two three

So they said that they had done it.

Fleas.
brian d foy
  • 129,424
  • 31
  • 207
  • 592
Jacob Wegelin
  • 1,304
  • 11
  • 16
  • Please explain what the first script is supposed to do. Some of the comments are wrong, e.g. `s/^\s*\n/\n/g; # collapse each all-whitespace line to a single newline` is not what it does. – melpomene Mar 30 '19 at 12:56

3 Answers3

1

To combine try1.pl and try2.pl into a single script you could try:

local $/ = "";
my @lines;
while (<>) {
    [...]    # Same code as in try1.pl except print statements
    push @lines, $_;
}

$lines[-1] =~ s/\n+$/\n/;
print for @lines;
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
  • This captures the output of the try1.pl loop into the lines variable. But what do the last two lines of code accomplish? Inside the same script, is it possible to initiate a new loop with input record separator redefined to undef, to emulate the behavior of try2.pl? In general, is there a way to run two successive while(<>) loops in a single perl script, the result of the first loop feeding into the 2nd? – Jacob Wegelin Mar 30 '19 at 21:13
  • You could use [pipes](https://perldoc.perl.org/perlipc.html#Bidirectional-Communication-with-Another-Process) but that will involve creating a new process. This is similar to pipes in Bash which also involves sub processes. Also, another alternative: if you transfer your second loop to a script (which you already have in `try2.pl`) you could use IPC modules like [IPC::Run3](https://metacpan.org/pod/IPC::Run3) to pipe output from your script to the other script. – Håkon Hægland Mar 30 '19 at 22:08
  • Still, a much simpler approach (which does not require creating new processes) is to let the first loop write to a [string buffer](https://perldoc.perl.org/functions/open.html) and read from that string buffer in the second loop. – Håkon Hægland Mar 30 '19 at 22:08
  • *"But what do the last two lines of code accomplish?*" They implement the effect of running `try2.pl` on the output from `try1.pl` – Håkon Hægland Mar 30 '19 at 22:11
  • 1
    Heh, I didn't read your comments until I finished most of my answer. I showed most of those things. – brian d foy Jan 02 '21 at 18:06
1

A pipe connects the output of one process to the input of another process. Neither one knows about the other nor cares how it operates.

But, putting things together like this breaks the Unix pipeline philosophy of small tools that each excel at a very narrow job. Should you link these two things, you'll always have to do both tasks even if you want one (although you could get into configuration to turn off one, but that's a lot of work).

I process a lot of LaTeX, and I control everything through a Makefile. I don't really care about what the commands look like and I don't even have to remember what they are:

short-clean.tex: short.tex
    cat short.tex | try1.pl | try2.pl > $@

Let's do it anyways

I'll limit myself to the constraint of basic concatenation instead of complete rewriting or rearranging, most because there are some interesting things to show.

Consider what happens should you concatenate those two programs by simply adding the text of the second program at the end of the text of the first program.

  • The output from the original first program still goes to standard output and the second program now doesn't get that output as input.

  • The input to the program is likely exhausted by the original first program and the second program now has nothing to read. That's fine because it would have read the unprocessed input to the first program.

There are various ways to fix this, but none of them make much sense when you already have two working program that do their job. I'd shove that in the Makefile and forget about it.

But, suppose you do want it all in one file.

  • Rewrite the first section to send its output to a filehandle connected to a string. It's output is now in the programs memory. This basically uses the same interface, and you can even use select to make that the default filehandle.

  • Rewrite the second section to read from a filehandle connected to that string.

Alternately, you can do the same thing by writing to a temporary file in the first part, then reading that temporary file in the second part.

A much more sophisticated program would the first program write to a pipe (inside the program) that the second program is simultaneously reading. However, you have to pretty much rewrite everything so the two programs are happening simultaneously.

Here's Program 1, which uppercases most of the letters:

#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
    print tr/a-z/A-Z/r;
    }

and here's Program 2, which collapses whitespace:

#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
    print s/\s+/ /gr;
    }

They work serially to get the job done:

$ perl program1.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D

$ perl program2.pl
The quick     brown dog jumped        over the lazy fox.
The quick brown dog jumped over the lazy fox.
^D

$ perl program1.pl | perl program2.pl
The quick     brown dog jumped        over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D

Now I want to combine those. First, I'll make some changes that don't affect the operation but will make it easier for me later. Instead of using implicit filehandles, I'll make those explicit and one level removed from the actual filehandles:

Program 1:

#!/usr/bin/perl
use v5.26;
$|++;
my $output_fh = \*STDOUT;
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }

Program 2:

#!/usr/bin/perl
$|++;
my $input_fh = \*STDIN;
while( <$input_fh> ) { # safer line input operator
    print s/\s+/ /gr;
    }

Now I have the chance to change what those filehandles are without disturbing the meat of the program. The while doesn't know or care what that filehandle is, so let's start by writing to a file in Program 1 and reading from that same file in Program 2:

Program 1:

#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

Program 2:

#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

However, you can no longer run these in a pipeline because Program 1 doesn't use standard output and Program 2 doesn't read standard input:

% perl program1.pl
% perl program2.pl

You can, however, now join the programs, shebang and all:

#!/usr/bin/perl
use v5.26;

open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

You can skip the file and use a string instead, but at this point, you've gone beyond merely concatenating files and need a little coordination for them to share the scalar with the data. Still, the meat of the program doesn't care how you made those filehandles:

#!/usr/bin/perl
use v5.26;

my $output_string;

open my $output_fh, '>', \ $output_string or die "$!";
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

#!/usr/bin/perl
$|++;
open my $input_fh, '<', \ $output_string or die "$!";
while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

So let's go one step further and do what the shell was already doing for us.

#!/usr/bin/perl
use v5.26;

pipe my $input_fh, my $output_fh;
$output_fh->autoflush(1);

while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

From here, it gets a bit tricky and I'm not going to go to the next step with polling filehandles so one thing can write and the the next thing reads. There are plenty of things that do that for you. And, you're now doing a lot of work to avoid something that was already simple and working.

Instead of all that pipe nonsense, the next step is to separate code into functions (likely in a library), and deal with those chunks of code as named things that hide their details:

use Local::Util qw(remove_comments minify);

while( <<>> ) {
    my $result = remove_comments($_);
    $result = minify( $result );
    ...
    }

That can get even fancier where you simply go through a series of steps without knowing what they are or how many of them there will be. And, since all the baby steps are separate and independent, you're basically back to the pipeline notion:

use Local::Util qw(get_input remove_comments minify);

my $result;
my @steps = qw(get_input remove_comments minify)
while( ! eof() ) {  # or whatever
    no strict 'refs'
    $result = &{$_}( $result ) for @steps;
    }

A better way makes that an object so you can skip the soft reference:

use Local::Processor;

my @steps = qw(get_input remove_comments minify);
my $processer = Local::Processor->new( @steps );

my $result;
while( ! eof() ) {  # or whatever
    $result = $processor->$_($result) for @steps;
    }

Like I did before, the meat of the program doesn't care or know about the steps ahead of time. That means that you can move the sequence of steps to configuration and use the same program for any combination and sequence:

use Local::Config;
use Local::Processor;

my @steps = Local::Config->new->get_steps;
my $processer = Local::Processor->new;

my $result;
while( ! eof() ) {  # or whatever
    $result = $processor->$_($result) for @steps;
    }

I write quite a bit about this sort of stuff in Mastering Perl and Effective Perl Programming. But, because you can do it doesn't mean you should. This reinvents a lot that make can already do for you. I don't do this sort of thing without good reason—bash and make have to be pretty annoying to motivate me to go this far.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
0

The motivating problem was to generate a "cleaned" version of a LaTeX file, which would be easy to search, using regex, for complex phrases or sentences.

The following single Perl script does the job, whereas previously I required one shell script and two Perl scripts, entailing three invocations of Perl. This new, single script incorporates three consecutive loops, each with a different input record separator.

  1. First loop:

    input = STDIN, or a file passed as argument; record separator=default, loop by line; print result to fileafterperlLIN, a temporary
    file on the hard drive.

  2. Second loop:

    input = fileafterperlLIN;
    record separator = "", loop by paragraph;
    print result to fileafterperlPRG, a temporary file on the hard drive.

  3. Third loop:

    input = fileafterperlPRG;
    record separator = undef, slurp entire file
    print result to STDOUT

This has the disadvantage of printing to and reading from two files on the hard drive, which may slow it down. Advantages are that the operation seems to require only one process; and all the code resides in a single file, which should make it easier to maintain.

#!/usr/bin/perl
# 2019v04v05vFriv17h18m41s

use strict;
use warnings;
use 5.18.2;

my $diagnose;
my $diagnosticstring;
my $exitcode;
my $userName =  $ENV{'LOGNAME'};
my $scriptpath;
my $scriptname;
my $scriptdirectory;
my $cdld;
my $fileafterperlLIN;
my $fileafterperlPRG;
my $handlefileafterperlLIN;
my $handlefileafterperlPRG;
my $encoding;
my $count;

sub diagnosticmessage {
    return unless ( $diagnose );
    print STDERR "$scriptname: ";
    foreach $diagnosticstring (@_) {
        printf STDERR "$diagnosticstring\n";
    }
}

# Routine setup
$scriptpath = $0;
$scriptname = $scriptpath;
$scriptname =~ s|.*\x2f([^\x2f]+)$|$1|;
$cdld = "$ENV{'cdld'}"; # A directory to hold temporary files used by scripts
$exitcode = system("test -d $cdld && test -w $cdld || { printf '%\n' 'cdld not a writeable directory'; exit 1; }");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;

$scriptdirectory = "$cdld/$scriptname"; # To hold temporary files used by this script
$exitcode = system("test -d $scriptdirectory || mkdir $scriptdirectory");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
diagnosticmessage ( "scriptdirectory=$scriptdirectory" );
$exitcode = system("test -w $scriptdirectory && test -x $scriptdirectory || exit 1;");
die "$scriptname: system returned exitcode=$exitcode: $scriptdirectory not writeable or not executable. bail\n" unless $exitcode == 0;
$fileafterperlLIN = "$scriptdirectory/afterperlLIN.tex";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
$exitcode = system("printf '' > $fileafterperlLIN;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$fileafterperlPRG = "$scriptdirectory/afterperlPRG.tex";
diagnosticmessage ( "fileafterperlPRG=$fileafterperlPRG" );
$exitcode=system("printf '' > $fileafterperlPRG;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;

# This script's job: starting with a LaTeX file, which may compile beautifully in pdflatex but be difficult
# to read visually or search automatically,
# (1) convert any line that looks blank --- a "trivial line", containing only whitespace --- to a pure newline. This is because
#     (a) LaTeX interprets any whitespace line following a non-blank or "nontrivial" line as end of paragraph, whereas
#     (b) Perl needs two consecutive newlines to signal end of paragraph.
# (2) remove all LaTeX comments;
# (3) deal with the \unskip LaTeX construct, etc.
# The result will be
# (4) each LaTeX paragraph will occupy a unique line
# (5) exactly one pair of newlines --- visually, one blank line --- will divide each pair of consecutive paragraphs
# (6) first paragraph will be on first line (no opening blank line) and last paragraph will be on last line (no ending blank line)
# (7) whitespace in output will consist of only
#     (a) a single space between readable strings, or
#     (b) double newline between paragraphs
#
$handlefileafterperlLIN = undef;
$handlefileafterperlPRG = undef;
$encoding = ":encoding(UTF-8)";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
open($handlefileafterperlLIN, ">> $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for appending: $!";

# Loop 1 / line:
# Default input record separator: loop through one line at a time, delimited by \n
$count = 0;
while (<>) {
    $count = $count + 1;
    diagnosticmessage ( "line $count" );
    s/^\s*\n/\n/mg; # Convert any trivial line to a pure newline.
    print $handlefileafterperlLIN $_;
}

close($handlefileafterperlLIN);
open($handlefileafterperlLIN, "< $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for reading: $!";
open($handlefileafterperlPRG, ">> $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for appending: $!";

# Loop PRG / paragraph:
local $/ = ""; # Input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
$count = 0;
while (<$handlefileafterperlLIN>) {
    $count = $count + 1;
    diagnosticmessage ( "paragraph $count" );
    s/(?<!\x5c)[\x25].*\n/ /g; # Remove all LaTeX comments.
    #    They start with % not \% and extend to end of line or newline character. Join to next line.
    #    s/(?<!\x5c)([\x24])/\x2a/g; # 2019v04v01vMonv13h44m09s any $ not preceded by backslash \, replace $ by * or something.
    #    This would be only if we are going to run detex on the output.
    s/(.)\n/$1 /g; # Any line that has something other than newline, and then a newline, is joined to the subsequent line
    s|([^\x2d])\s*(\x2d\x2d\x2d)([^\x2d])|$1 $2$3|g; # consistent treatment of triple hyphen as em dash
    s|([^\x2d])(\x2d\x2d\x2d)\s*([^\x2d])|$1$2 $3|g; # consistent treatment of triple hyphen as em dash, continued
    s/[\x0b\x09\x0c\x20]+/ /gm; # collapse each "run" of whitespace other than newline, to a single space.
    s/\s*[\x5c]unskip(\x7b\x7d)?\s*(\S)/$2/g; # LaTeX whitespace-collapse across newlines
    s/^\s*//; # Any nontrivial line: No indenting. No whitespace in first column.
    print $handlefileafterperlPRG $_;
    print $handlefileafterperlPRG "\n\n"; # make sure each paragraph ends with 2 newlines, hence at least 1 blank line.
}
close($handlefileafterperlPRG);

open($handlefileafterperlPRG, "< $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for reading: $!";

# Loop slurp
local $/ = undef;  # Input record separator: entire file is a single record.
$count = 0;
while (<$handlefileafterperlPRG>) {
    $count = $count + 1;
    diagnosticmessage ( "slurp $count" );
    s/[\n][\n]+/\n\n/g;  # Exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
    s/[\n]+$/\n/;        # Last line is visible or "nontrivial"; no trivial (blank) line at the end
    s/^[\n]+//;          # No trivial (blank) line at the start. The first line is "nontrivial."
    print STDOUT;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jacob Wegelin
  • 1,304
  • 11
  • 16
  • I disagree with the notion that the contents of a single file are easier to maintain. It's more code to look at. I'm much more comfortable having two small programs in a repo. The repo keeps both around and once I setup a build file, I don't think about it again. – brian d foy Jan 02 '21 at 17:48
  • Also, since most of the beginning of the program is a mess of system calls, this is practically crying for a build file that's outside of the Perl program. – brian d foy Jan 02 '21 at 17:51