1

I have a job that reads data from a \n delimited stream and sends the information to xargs to process 1 line at a time. The problem is, this is not performant enough, but I know that if I altered the program such that the command executed by xargs was sent multiple lines instead of just one line at a time, it could drastically improve the performance of my script.

Is there a way to do this? I haven't been having any luck with various combinations of -L or -n. Unfortunately, I think I'm also stuck with -I to parameterize the input since my command doesn't seem to want to take stdin if I don't use -I.

The basic idea is that I'm trying to simulate mini-batch processing using xargs.

Conceptually, here's something similar to what I currently have written

contiguous-stream | xargs -d '\n' -n 10 -L 10 -I {} bash -c 'process_line {}'

^ in the above, process_line is easy to change so that it could process many lines at once, and this function right now is the bottleneck. For emphasis, above, -n 10 and -L 10 don't seem to do anything, my lines are still processing one at a time.

josiah
  • 1,314
  • 1
  • 13
  • 33
  • 1
    `xargs -I {} bash -c 'something with {}'` is dangerous. What if one of the lines in your stream contains `$(rm -rf ~)`? – Charles Duffy Oct 10 '17 at 00:08
  • Why do you call `bash` to execute `process_line`? If it is a standalone executable you can call it directly to avoid the security problem and some overhead. If it is a shell function or alias you can create a wrapper shell executable to call instead of `bash -c ...`. – pabouk - Ukraine stay strong Oct 10 '17 at 00:21
  • @pabouk, ...agreed that calling an executable directly is preferred, but if the code being invoked is native shell, I'm not sure a wrapper buys anything over `bash -c`, *particularly* if you have an exported function you're trying to invoke. Could you expand? – Charles Duffy Oct 10 '17 at 00:38
  • @CharlesDuffy you are right. It is just an alternative how to avoid the mentioned security issue caused by evaluation of the parameters by shell. – pabouk - Ukraine stay strong Oct 10 '17 at 01:45

1 Answers1

9

Multiple Lines Per Shell Invocation

Don't use -I here. It prevents more than one argument from being passed at a time, and is outright major-security-bug dangerous when being used to substitute values into a string passed as code.

contiguous-stream | xargs -d $'\n' -n 10 \
  bash -c 'for line in "$@"; do process_line "$line"; done' _

Here, we're passing arguments added by xargs out-of-band from the code, in position populated from $1 and later, and then using "$@" to iterate over them.

Note that this reduces overhead inasmuch as it passes multiple arguments to each shell (so you pay shell startup costs fewer times), but it doesn't actually process all those arguments concurrently. For that, you want...

Multiple Lines In Parallel

Assuming GNU xargs, you can use -P to specify a level of parallel processing:

contiguous-stream | xargs -d $'\n' -n 10 -P 8 \
  bash -c 'for line in "$@"; do process_line "$line"; done' _

Here, we're passing 10 arguments to each shell, and running 8 shells at a time. Tune your arguments to taste: Higher values of -n spend less time starting up new shells but increase the amount of waste at the end (if one process still has 8 to go and every other process is done, you're operating suboptimally).

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • `-I` doesn't actually prevent more than 1 argument from being passed at a time, but no argument about the security hole. – Joey Coleman Oct 10 '17 at 00:13
  • 1
    @JoeyColeman, ...I certainly don't see any clear specification in [the POSIX `xargs` spec](http://pubs.opengroup.org/onlinepubs/009604599/utilities/xargs.html) for how it should behave when `-I` has the implied `-L 1` overridden. Thus, it looks like any ability to work without that is an extension. Granted, I'm taking advantage of `-d`, another extension, but I'd argue a much less ambiguously-specified one. :) – Charles Duffy Oct 10 '17 at 00:18
  • @JoeyColeman, it does seem like -I prevents "batching" of parameters. I deleted my answer because Charles' is better, but try for i in {1..100}; do echo $i; done | tr '\n' '\0' | xargs -0 -n10 and then try with -I {} echo {} and you'll see that -I definitely affects the xargs param handling. – Ian McGowan Oct 10 '17 at 00:19
  • The other trick that makes this clearer for me is to replace \n in the input stream with \0 using tr '\n' '\0' and then use xargs -0. – Ian McGowan Oct 10 '17 at 00:23
  • 2
    In GNU `xargs` it seems that if you specify `-L x` or `-n x` after `-I {}` then the effect of `-I` is completely cancelled. – pabouk - Ukraine stay strong Oct 10 '17 at 00:24
  • Deleted my answer — `-L` after `-I` disables the `-I` entirely (my use of `echo` in tests masked this) – Joey Coleman Oct 10 '17 at 00:29