1

Very simple question (I think) that I'm surprised I can't seem to find an answer to. So I have the following so far:

£ perl -ne 'print if /ENGPacific Beach\s\s/' 15AM171H0N15000GAJK5 \
| perl -ane 'print "$F[1]|";END{print "\0"}' | xargs -i -0 echo {}
    3346|10989|95459|139670|2239329|3195595|3210017|

So....the first pipe is because the file is 1.5G, so not doing record separation initially greatly speeds things up. The xargs part is to demonstrate what I'm trying to do. Which is basically the following

| xargs -i perl --setperlvar pipeContents={} -ane 'print if $F[3] =~ /$pipeContents/' 15AM171H0N15000GAJK5

1) I know I could use ARGV in a script. I know the whole thing should just be a single script. Let's ignore those bits. My love for -n knows no bounds.

2) Sorry I couldn't find this myself..I'm sure it's incredibly obvious...I did some digging in perldoc and found nothing, though.

3) I'd be interested in a bash/zsh solution that forces the {} to be interpreted by the shell in the middle of the perl ticks as well.

zzxyz
  • 2,953
  • 1
  • 16
  • 31
  • 2
    I am not sure of all that's involved in your pipeline, but see three ways to pass arguments from bash to Perl one-liner in [this post](https://stackoverflow.com/a/51713578/4653379) – zdim Aug 13 '18 at 20:39
  • 1
    Proposal (3) is a source of major security vulnerabilities. Actually-exploited-in-the-real-world security vulnerabilities -- search for "shell injection"; injecting data as code into a perl interpreter is just as bad as injecting it into a shell. Don't do that, *especially* when you're processing 1.5GB of data and can't read through it by hand to find any `"+system("rm -rf ~")+"` substrings. – Charles Duffy Aug 13 '18 at 20:47
  • (For that matter, `xargs` without `-0` or `-d $'\n'` is also considered bad practice in the bash world for some rather good reasons; see [the "Actions in Bulk" section of Using Find](http://mywiki.wooledge.org/UsingFind#Actions_in_bulk:_xargs.2C_-print0_and_-exec_.2B-)). – Charles Duffy Aug 13 '18 at 20:50
  • Is `--setvar` a thing, or something you made up? – that other guy Aug 13 '18 at 20:51
  • --setvar is a thing I made up to demonstrate what I'm trying to do. The data is 100% trusted and this is a "one-shot" program I'm using to search data. @zdim I think proposal 2 answers my question. I don't believe the question itself is a duplicate so if you want to post as answer I'll accept. Any edits to my question to improve searchability are welcome. – zzxyz Aug 13 '18 at 21:17
  • And @CharlesDuffy agree. I was being sloppy and I'll fix. I trust the data (and know how it is formatted) but no reason not to use -0 – zzxyz Aug 13 '18 at 21:23
  • @zzxyz That may be a good idea since the title is crystal clear (while that one is a different business). However, there is more involved here and I am still not quite sure of it all -- do you mean: by `--setvar` to indicate that you want some way to "set a var", which would be the shell variable `pipeContents`, which would be set to the current phrase processed by `xargs` (is that what `{}` do here?) ... ? That's quite a bit rolled into one :) I can't suggest another way since I'm also unsure of the whole context; is `xargs` crucial here or not? – zdim Aug 13 '18 at 21:51
  • Re "*3)*", You mean like `echo abc | xargs -i perl -E'say "<{}>"'`? Well, it "works" (because `xargs` sees two string, `perl` and `-Esay "<{}>"`, and it replaces all instances of `{}` in those strings), but don't do that. Generating Perl code from the shell is a BAD idea. – ikegami Aug 13 '18 at 22:46
  • I just realized ikegami and Charles Duffy were referring to *my* proposal #3, and not option #3 from zdim's link. Fair enough, and the `-s` option keeps that from being desirable. – zzxyz Aug 13 '18 at 23:37
  • @zdim - Regarding clarifying my question, you may have figured it out, but `xargs -i`. Well, create a directory with 3 files and then do `find . -print0 | xargs -0 echo hello` and `find . -print0 | xargs -0 -i echo {} hello` Two things. It allows you to put the pipe "output" wherever. It has the side-effect of `-n1`, which, for example, has the side-effect of causing perl to output correct file line numbers with `$.` ...and slowing perl WAY down if there are a ton of files. – zzxyz Aug 14 '18 at 00:23
  • 1
    thank you for explanation -- yes, I know `xargs` but I wasn't sure whether it was integral to what you want, or used only to "_demonstrate..._". I also wasn't sure what altogether the goal was. But it's all good now, you've got a _really thorough_ analysis from ikegami :) – zdim Aug 14 '18 at 17:27

2 Answers2

5

Two notes before I start:

  • The trailing | in the pattern will cause every line to match. It needs to be removed.
  • /3346|10989|95459|139670|2239329|3195595|3210017/ will match 9993346, so you need to anchor the pattern.

Fixes for these problems are present in all of the following solutions.


You can pass data to a program through

  • Argument list
  • Environment
  • An open file descriptor (e.g. stdin, but fd 3 or higher could also be used) to a pipe
  • External storage (file, database, memcache daemon, etc)

You can still use the argument list. You just need to remove the argument from @ARGV before the loop starts by using BEGIN or avoiding -n.

perl -ne'print if /ENGPacific Beach\s\s/' 15AM171H0N15000GAJK5 |
perl -ane'push @p, $F[1]; END { print join "|", @p; }' |
xargs -i perl -ane'
    BEGIN { $p = shift(@ARGV); }
    print if $F[3] =~ /^(?:$p)\z/;
' {} 15AM171H0N15000GAJK5

Perl also has a built-in argument parsing function in the form of -s you could utilize.

perl -ne'print if /ENGPacific Beach\s\s/' 15AM171H0N15000GAJK5 |
perl -ane'push @p, $F[1]; END { print join "|", @p; }' |
xargs -i perl -sane'print if $F[3] =~ /^(?:$p)\z/' -- -p={} 15AM171H0N15000GAJK5

xargs doesn't seem to have an option to set an environment variable, so taking that approach gets a little complicated.

perl -ne'print if /ENGPacific Beach\s\s/' 15AM171H0N15000GAJK5 |
perl -ane'push @p, $F[1]; END { print join "|", @p; }' |
xargs -i sh -c '
    P="$1" perl -ane'\''print if $F[3] =~ /^(?:$ENV{P})\z/'\'' 15AM171H0N15000GAJK5
' dummy {}

It's weird to involve xargs for a single line. If we avoid xargs, we can turn the above (ugly) command inside out, giving something quite nice.

P="$(
    perl -ne'print if /ENGPacific Beach\s\s/' 15AM171H0N15000GAJK5 |
    perl -ane'push @p, $F[1]; END { print join "|", @p; }'
)" perl -ane'print if $F[3] =~ /^(?:$ENV{P})\z/' 15AM171H0N15000GAJK5

By the way, you don't need a second perl to split only the matching lines.

P="$(
    perl -ne'
       push @p, (split)[1] if /ENGPacific Beach\s\s/;
       END { print join "|", @p; }
    ' 15AM171H0N15000GAJK5
)" perl -ane'print if $F[3] =~ /^(?:$ENV{P})\z/' 15AM171H0N15000GAJK5

That said, I think using $ENV{P} repeatedly should be avoided to speed things up.

P=... perl -ane'print if $F[3] =~ /^(?:$ENV{P})\z/o' 15AM171H0N15000GAJK5

From there, I see two possible speed improvements. (Test to be sure.)

  1. Avoiding splitting entirely in the last perl.

    P=... perl -ne'
       BEGIN { $re = qr/^(?:\S+\s+){3}(?:$ENV{P})\s/o; }
       print if /$re/o;
    ' 15AM171H0N15000GAJK5
    
  2. Avoiding regular expressions entirely in the last perl.

    P=... perl -ane'
       BEGIN { %h = map { $_ => 1 } split /\|/, $ENV{P} }
       print if $h{$F[3]};
    ' 15AM171H0N15000GAJK5
    
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I'm gonna be a day or so processing this answer:) Regarding bulletpoint #2...it's a great point. I really don't even need a regex for the last bit...Just a straight compare or "string begins with". That said, a few extra results are fine. They are going to a terminal, where I'm inspecting them by hand. I'm basically looking at a text dump of a SQL-style database...manually inspecting the data, more or less. – zzxyz Aug 13 '18 at 23:48
  • kudos on the detail in the answer. – Gerhard Aug 14 '18 at 13:34
2

A handy way to pass arguments is via the -s switch, which enables command-line switches for the program

perl -s -E'say $var' -- -var=value

The -- after the program marks the start of arguments for the program. Then -var introduces a variable $var into the program, with a value for it supplied after =; what is there is expanded by the shell first. With just -var the variable $var gets value 1.

Any such options must come before possible filenames, and they are removed from @ARGV so the program can normally process the submitted files

perl -s -ne'...' -- -var="$SHELL_VAR" filename

where -var={} works, too. In some shells (tcsh for one) it may need be escaped, \{\}.

However, I also think that it'd be better to not go to xargs. See ikegami's answer for an extremely rounded discussion and various ways, as well as their comment beneath this post for how to avoid it with -s.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • 1
    Good idea. Added `-s` to my answer. (heh, `perl -sane`) Still think avoiding `xargs` is the way to go, though. But you can still do that with `-s` by changing `P="$( ... )" perl -ane'...$ENV{P}...' filename` into `perl -sane'...$p...' -- -p="$( ... )" filename` – ikegami Aug 13 '18 at 22:39
  • 1
    @ikegami Absolutely agree on `xargs`. (I think that they have it only as a way to show what they want?) I didn't discuss anything else because of your absolutely exhaustive answer. I think I'd still leave this post since it provides details on `-s`. – zdim Aug 14 '18 at 06:21