1

A Perl program uses IPC::Run to pipe a file through a series of commands determined at runtime and into another file, like this small test excerpt demonstrates:

#!/usr/bin/perl
use IO::File;
use IPC::Run qw(run);

open (my $in, 'test.txt');
my $out = IO::File->new_tmpfile;

my @args = ( [ split / /, shift ], "<", $in); # this code
while ($#ARGV >= 0) {                         # extracted
    push @args, "|", [ split / /, shift ];    # verbatim
}                                             # from the
push @args, ">pipe", $out;                    # program

print "Running...";
run @args or die "command failed ($?)";
print "Done\n";

It builds the pipeline from commands given as arguments, the test file is hard-coded. The problem is that the pipeline hangs if the file is bigger than 64KiB. Here is a demonstration that uses cat in the pipeline to keep things simple. First a 64KiB (65536 bytes) file works as expected:

$ dd if=/dev/urandom of=test.txt bs=1 count=65536
65536 bytes (66 kB, 64 KiB) copied, 0.16437 s, 399 kB/s
$ ./test.pl cat
Running...Done

Next, one byte more. The call to run never returns...

$ dd if=/dev/urandom of=test.txt bs=1 count=65537
65537 bytes (66 kB, 64 KiB) copied, 0.151517 s, 433 kB/s
$ ./test.pl cat
Running...

With IPCRUNDEBUG enabled, plus a few more cats you can see it's the last child that doesn't end:

$ IPCRUNDEBUG=basic ./test.pl cat cat cat cat
Running...
...
IPC::Run 0000 [#1(3543608)]: kid 1 (3543609) exited
IPC::Run 0000 [#1(3543608)]: 3543609 returned 0
IPC::Run 0000 [#1(3543608)]: kid 2 (3543610) exited
IPC::Run 0000 [#1(3543608)]: 3543610 returned 0
IPC::Run 0000 [#1(3543608)]: kid 3 (3543611) exited
IPC::Run 0000 [#1(3543608)]: 3543611 returned 0

(with a file under 64KiB you see all four exit normally)

How can this be made to work for files of any size ?

(Perl 5, version 30, subversion 3 (v5.30.3) built for x86_64-linux-thread-multi, tried on Alpine Linux, the target platform, and Arch Linux to rule out Alpine as a cause)

starfry
  • 9,273
  • 7
  • 66
  • 96
  • See also [Deadlocks due to buffering. How does it work?](https://stackoverflow.com/q/40189625/2173773) – Håkon Hægland Jul 05 '20 at 07:55
  • 1
    Tip: Replace `use IO::File; my $out = IO::File->new_tmpfile;` (which create a file) with `use Symbol qw( gensym ); my $out = gensym;` (which creates an anonymous glob). There's no point in creating a file only to close it immediately afterwards! – ikegami Jul 05 '20 at 09:47
  • @ikegami the file in the real program is not closed, but the program goes on to use it. The above is a minimal example. And Håkon Hægland I read about pipe buffering but I don't understand how that applies when the pipe empties into a file? – starfry Jul 05 '20 at 09:49
  • 1
    @starfry, It's closed by `run` when it replaces the file with a pipe. – ikegami Jul 05 '20 at 09:52
  • Re "*when the pipe empties into a file?*", Pipes don't empty into anything. Pipes are read/emptied by having a process read from them. – ikegami Jul 05 '20 at 10:07

2 Answers2

2

You have a deadlock:

Diagram of the deadlock

Consider using one of the following instead:

run [ 'cat' ], '<', $in_fh, '>', \my $captured;

# Do something with the captured output in $captured.

or

my $receiver = sub {
    # Do something with the chunk in $_[0].
};

run [ 'cat' ], '<', $in_fh, '>', $receiver;

For example, the following "receiver" processes each line as they come in:

my $buffer = '';
my $receiver = sub {
    $buffer .= $_[0];
    while ($buffer =~ s/^(.*)\n//) {
       process_line("$1");
    }
};

run [ 'cat' ], '<', $in_fh, '>', $receiver;

die("Received partial line") if length($buffer);
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Thanks. As a first-time user of `IPC::run`, I thought from [reading about it](https://metacpan.org/pod/IPC::Run) that I could just use a file handle for the output and assumed that this meant that `run` would deliver to the file and handle any blocking. Lulled to a false state of happiness when it worked with small tests. I used your suggested subroutine method and my program now works as it should. – starfry Jul 05 '20 at 12:41
  • 1
    Oh, it can definitely send the output to a file handle or a file without problem (e.g. by using `'>', $file_handle` or `'>', $qualified_file_name`), but you asked it to create a pipe you can read from and to send the output to that pipe. It's obviously not going to read from that pipe cause that would defy the purpose of asking for a pipe! – ikegami Jul 05 '20 at 12:57
  • @ikegami I understand there is a difference between `>pipe` and `>` if the size of the file exceeds 64 KiB.. But I do not understand exactly why `>pipe` cannot work the same way as `>`. Is it because a pipe is allocated in memory and the `>` is allocated on disk? – Håkon Hægland Jul 05 '20 at 15:44
  • I ended up here because I originally used `>` to a file handle but it didn't work. I swithced to `>pipe` and got it working (subject to the 64K issue). I've now tried `>` to a file handle once again and it works. I don't know why it didn't before - probably becuase of a silly mistake I must have made because Perl is not my usual. – starfry Jul 05 '20 at 18:34
  • 1
    No, it never worked with `>pipe`. Nothing was ever written to the file you opened. As previously mentioned, the file handle was replaced with one to a pipe as you requested, closing the file before it was ever used. – ikegami Jul 05 '20 at 19:22
  • @ikegami *"No, it never worked with >pipe"* Got it! So for the `>pipe` to be useful, e.g. like in the [example](https://metacpan.org/pod/IPC::Run#Redirection-Operators) starting with *"The >pipe operator is different in concept than the other..."* it would require that the program to read from the `OUT` handle (see the example) after manually calling `select()` (but after the call to `start()`) to check if it is ready. This must be done before calling `finish()` in order to empty the pipe to avoid deadlock? – Håkon Hægland Jul 05 '20 at 20:16
0

Here is an example that does not deadlock but still uses the >pipe output handle. I would not recommend using this complicated approach for your use case, instead consider the approach suggested by @ikegami.

The problem is that the >pipe handle is never read from. cat tries to write to the >pipe handle but it gets filled up (since no one reads from it) and the cat process blocks when the pipe content reaches 64 KiB which is the capacity of a pipe on Linux. Now the IPC::Run::finish() process is waiting for the child cat process to exit, but at the same time the cat process is waiting for the parent to read from its pipe so we have a deadlock situation.

To avoid this situation, we can use IPC::Run::start() instead of IPC::Run::run():

use feature qw(say);
use strict;
use warnings;
use constant READ_BUF_SIZE => 8192;

use Errno qw( EAGAIN );
use IO::Select;
use IPC::Run qw();
use Symbol 'gensym';

my $outfile = 'out.txt';
open (my $out, '>', $outfile) or die "Could not open file '$outfile': $!";
my $h = IPC::Run::start ['cat'], '<', 'test.txt', '>pipe', my $pipeout = gensym;
my $select = IO::Select->new( $pipeout );
my $data = '';
my $read_offset = 0;
while (1) {
    my @ready = $select->can_read;
    last if !@ready;
    for my $fh (@ready) {
        my $bytes_read = sysread $fh, $data, READ_BUF_SIZE, $read_offset;
        say "Read $bytes_read bytes..";
        if ( !defined $bytes_read ) {
            die "sysread failed: $!" if $! != EAGAIN;
            $bytes_read = 0;
        }
        elsif ( $bytes_read == 0 ) {
            say "Removing pipe handle from select loop";
            $select->remove( $fh );
            close $fh;
        }
        $read_offset += $bytes_read;
    }
}
say "Saving data to file..";
print $out $data;  #Save data to file
close $out;
say "Finishing harness..";
IPC::Run::finish $h or die "cat returned $?";
say "Done.";
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174