How to pipe a text file with very long lines to a script line-by-line?

Question

I want to run a script on each line of an input file/stream in bash. This is simple when the lines are short:

while read -r line; do echo "$line" | ./process.sh; done < input.txt

However, the lines are very long (several MiB) so I cannot use read any more. This works:

split -l1 < input.txt
for file in x??; do ./process.sh < "$file"; done
rm x??

but it's very slow as it creates temporary files.

Is there a way to pipe the input lines directly to process.sh so that the script is invoked once per line?

Ed Morton · Answer 1 · 2023-04-28T11:34:34.593

2

Given this input file:

$ cat file
foo bar
etc

Try this with GNU xargs (using awk '{print "<"$0">"}' as ./process.sh):

$ < file xargs -I {} -d'\n' -n1 printf '%s\n' '{}' | awk '{print "<"$0">"}'
<foo bar>
<etc>

or this otherwise:

$ tr '\n' '\0' < file | xargs -I {} -0 -n1 printf '%s\n' '{}' | awk '{print "<"$0">"}'
<foo bar>
<etc>

See https://stackoverflow.com/a/28806991/1745001 for an explanation.

edited Apr 28 '23 at 11:34

answered Apr 28 '23 at 11:19

Ed Morton

188,023
17
78
185

In both cases, the line is an argument to process.sh, which has a size limit. Is there a way to get the line at stdin? – Christoph Walesch Apr 28 '23 at 11:28

Renaud Pacalet · Answer 2 · 2023-04-28T12:54:37.193

GNU parallel, different from xargs, is capable of sending the data to the standard input of the command instead of passing it as arguments. So you could try:

parallel -j1 -N1 --pipe ./process.sh < input.txt

-j1 for one job at a time only. Use -jN to run N jobs in parallel, or -j+0 to run as many jobs as you have cores on your computer.

-N1 to pass one line per job.

--pipe to send the data to the standard input of the command instead of passing it as command arguments.

score 0 · Answer 3 · answered Apr 28 '23 at 10:26

0

Try sed:

$ cat file
one
two
tre

$ sed 's/.*/echo "|&|"/e' file
|one|
|two|
|tre|

In your case it would be like:

$ sed 's|.*|./process.sh "&"|e' file

Awk can also run commands, but I bet awk alone can do what you need.

answered Apr 28 '23 at 10:26

Ivan

6,188
1
16
23

sed's /e flags makes its output an argument of the command to execute, which has a size limit – Christoph Walesch Apr 28 '23 at 10:58

score 0 · Answer 4 · answered May 02 '23 at 16:22

I've tested the while read ... code in the question with input lines of 8MB and it works fine. I assume your issue is that it is too slow.

If you can use Perl then this code shows one way to do it:

perl -Mautodie -nle "open P, '| ./process.sh'; print P; close P" input.txt

The -Mautodie causes the code to fail with error messages if any file or pipe operations fail. Although the autodie module has been a core module in Perl for over a decade, it may not be available in some Perl installations. Some are very old. Others (for unknown reasons) don't include all core modules. The code will work if you remove -Mautodie, but it may fail silently if something goes wrong.

If Python is an option, this code may be of use:

IFS= read -r -d '' python_code <<'_END_PYTHON_CODE_'
import sys
from subprocess import Popen, PIPE

for line in sys.stdin:
    with Popen(sys.argv[1:], stdin=PIPE, text=True) as p:
        p.stdin.write(line)
_END_PYTHON_CODE_

python -c "$python_code" ./process.sh <input.txt

I don't have much experience of Python, so the code may not be good. It runs slightly slower than the Perl code in my testing.

How to pipe a text file with very long lines to a script line-by-line?

4 Answers4