Why the shell script is slow
The reason it is slow is that for each of the 500 million lines, you are forcing your shell to create 3 processes, so your kernel is hard at work spawning 1.5 billion processes. Suppose it can handle 10 thousand processes a second; you’re still looking at 150 thousand seconds, which is 2 days. And 10k processes per second is fast; probably a factor of ten or more better than you’re getting. On my 2016 15" MacBook Pro running macOS High Sierra 10.13.1, with a 2.7 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3, and 500 GB Flash storage (about 150 GB free), I was getting around 700 processes per second, so the script would nominally take almost 25 days to run through 500 million records.
Ways to speed it up
There are ways to make the code faster. You could use plain shell, or Awk, or Python, or Perl. Note that if you use Awk, it needs to be GNU Awk, or at least not BSD (macOS) Awk — the BSD version simply decided it hadn't got enough file descriptors.
I used a random data generator to create a file with 100,000 random entries somewhat similar to those in the question:
E1E583ZUT9 E1E583ZUT9.9 422255 490991884
Z0L339XJB5 Z0L339XJB5.0 852089 601069716
B3U993YMV8 B3U993YMV8.7 257653 443396409
L2F129EXJ4 L2F129EXJ4.8 942989 834728260
R4G123QWR2 R4G123QWR2.6 552467 744905170
K4Z576RKP0 K4Z576RKP0.9 947374 962234282
Z4R862HWX1 Z4R862HWX1.4 909520 2474569
L5D027SCJ5 L5D027SCJ5.4 199652 773936243
R5R272YFB5 R5R272YFB5.4 329247 582852318
G1I128BMI2 G1I128BMI2.6 359124 404495594
(The command used is a home-brew generator that's about to get a rewrite.) The first two columns have the same 10 leading characters in the pattern X#X###XXX#
(X
for letter, #
for digit); the only difference is in the suffix .#
. This isn't exploited in the scripts; it doesn't matter in the slightest. There's also no guarantee that the values in the second column are unique, nor that the .1
entry appears for a key if the .2
entry appears, etc. These details are mostly immaterial to the performance measurements. Because of the letter-digit-letter prefix used for the file names, the 26 * 10 * 26 = 6760 possible file prefixes. With the 100,000 randomly generated records, every one of those prefixes is present.
I wrote a script to time various ways of processing the data. There are 4 shell script variants — the one posted by Lucas A, the OP; two posted by chepner (one as comments), and one I created. There's also the Awk script created by dawg, a mildly modified version of the Python 3 script posted by chepner, and a Perl script I wrote.
Results
The results can be summarized by this table (run-time measured in seconds of elapsed time or wall clock time):
╔═════════════════╦════╦═════════╦═════════╦═════════╦═════════╗
║ Script Variant ║ N ║ Mean ║ Std Dev ║ Min ║ Max ║
╠═════════════════╬════╬═════════╬═════════╬═════════╬═════════╣
║ Lucas A Shell ║ 11 ║ 426.425 ║ 16.076 ║ 408.044 ║ 456.926 ║
║ Chepner 1 Shell ║ 11 ║ 39.582 ║ 2.002 ║ 37.404 ║ 43.609 ║
║ Awk 256 ║ 11 ║ 38.916 ║ 2.925 ║ 30.874 ║ 41.737 ║
║ Chepner 2 Shell ║ 11 ║ 16.033 ║ 1.294 ║ 14.685 ║ 17.981 ║
║ Leffler Shell ║ 11 ║ 15.683 ║ 0.809 ║ 14.375 ║ 16.561 ║
║ Python 7000 ║ 11 ║ 7.052 ║ 0.344 ║ 6.358 ║ 7.771 ║
║ Awk 7000 ║ 11 ║ 6.403 ║ 0.384 ║ 5.498 ║ 6.891 ║
║ Perl 7000 ║ 11 ║ 1.138 ║ 0.037 ║ 1.073 ║ 1.204 ║
╚═════════════════╩════╩═════════╩═════════╩═════════╩═════════╝
The original shell script is 2.5 orders of magnitude slower than Perl; Python and Awk have almost the same performance when there are enough file descriptors available (Python simply stops if there aren't enough file descriptors available; so does Perl). The shell script can be made about about half as fast as Python or Awk.
The 7000 denotes the number of open files needed (ulimit -n 7000
). This is because there are 26 * 10 * 26 = 6760 different 3-character starting codes in the generated data. If you have more patterns, you'll need more open file descriptors to gain the benefit of keeping them all open, or you will need to write a file descriptor caching algorithm somewhat like the one that GNU Awk must be using, with the consequential performance loss. Note that if the data was presented in sorted order, so that all the entries for each file were presented in sequence, then you'd be able to tweak the algorithms so that only one output file was open at a time. The random generated data is not in sorted order, so it hits any caching algorithm hard.
Scripts
Here are the various scripts tested during this exercise. These, and much of the supporting material, is available on GitHub in soq/src/so-4747-6170.
Not all the code used is present in GitHub.
Lucas A Shell — aka opscript.sh
cat "$@" |
while read line
do
PREFIX=$(echo "$line" | cut -f2 | cut -c1-3)
echo -e "$line" >> split_DB/$PREFIX.part
done
This is a not-entirely useless use of cat
(see UUoC — Useless Use of cat
for the comparison). If no arguments are provided, it copies standard input to the while
loop; if any arguments are provided, they're treated as file names and passed to cat
and it copies the contents of those files to the while
loop. The original script had a hard-wired < file
in it. There is no measurable performance cost to using cat
here. A similar change was needed in Chepner's shell script.
Chepner 1 Shell — aka chepner-1.sh
cat "${@}" |
while read -r line; do
read -r _ col2 _ <<< "$line"
prefix=${col2:0:3}
printf '%s\n' "$line" >> split_DB/$prefix.part
done
Chepner 2 Shell — aka chepner-2.sh
cat "${@}" |
while read -r line; do
prefix=${line:12:3}
printf '%s\n' "$line" >> split_DB/$prefix.part
done
Leffler Shell — aka jlscript.sh
sed 's/^[^ ]* \(...\)/\1 &/' "$@" |
while read key line
do
echo "$line" >> split_DB/$key.part
done
Awk Script - aka awkscript.sh
exec ${AWK:-awk} '{s=substr($2,1,3); print >> "split_DB/" s ".part"}' "$@"
This wins hands down for compactness of script, and has decent performance when run with GNU Awk and with enough available file descriptors.
Python Script — aka pyscript.py
This is a Python 3 script, a mildly modified version of what Chepner posted.
import fileinput
output_files = {}
#with open(file) as fh:
# for line in fh:
for line in fileinput.input():
cols = line.strip().split()
prefix = cols[1][0:3]
# Cache the output file handles, so that each is opened only once.
#outfh = output_files.setdefault(prefix, open("../split_DB/{}.part".format(prefix), "w"))
outfh = output_files.setdefault(prefix, open("split_DB/{}.part".format(prefix), "w"))
print(line, file=outfh)
# Close all the output files
for f in output_files.values():
f.close()
Perl Script — aka jlscript.pl
#!/usr/bin/env perl
use strict;
use warnings;
my %fh;
while (<>)
{
my @fields = split;
my $pfx = substr($fields[1], 0, 3);
open $fh{$pfx}, '>>', "split_DB/${pfx}.part" or die
unless defined $fh{$pfx};
my $fh = $fh{$pfx};
print $fh $_;
}
foreach my $h (keys %fh)
{
close $fh{$h};
}
Test Script — aka test-script.sh
#!/bin/bash
#
# Test suite for SO 4747-6170
set_num_files()
{
nfiles=${1:-256}
if [ "$(ulimit -n)" -ne "$nfiles" ]
then if ulimit -S -n "$nfiles"
then : OK
else echo "Failed to set num files to $nfiles" >&2
ulimit -HSa >&2
exit 1
fi
fi
}
test_python_7000()
{
set_num_files 7000
timecmd -smr python3 pyscript.py "$@"
}
test_perl_7000()
{
set_num_files 7000
timecmd -smr perl jlscript.pl "$@"
}
test_awk_7000()
{
set_num_files 7000
AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript.sh "$@"
}
test_awk_256()
{
set_num_files 256 # Default setting on macOS 10.13.1 High Sierra
AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript-256.sh "$@"
}
test_op_shell()
{
timecmd -smr sh opscript.sh "$@"
}
test_jl_shell()
{
timecmd -smr sh jlscript.sh "$@"
}
test_chepner_1_shell()
{
timecmd -smr bash chepner-1.sh "$@"
}
test_chepner_2_shell()
{
timecmd -smr bash chepner-2.sh "$@"
}
shopt -s nullglob
# Setup - the test script reads 'file'.
# The SOQ global .gitignore doesn't permit 'file' to be committed.
rm -fr split_DB
rm -f file
ln -s generated.data file
# Ensure cleanup
trap 'rm -fr split_DB; exit 1' 0 1 2 3 13 15
for function in \
test_awk_256 \
test_awk_7000 \
test_chepner_1_shell \
test_chepner_2_shell \
test_jl_shell \
test_op_shell \
test_perl_7000 \
test_python_7000
do
mkdir split_DB
boxecho "${function#test_}"
time $function file
# Basic validation - the same information should appear for all scripts
ls split_DB | wc -l
wc split_DB/* | tail -n 2
rm -fr split_DB
done
trap 0
This script was run using the command line notation:
time (ulimit -n 7000; TRACEDIR=. Trace bash test-script.sh)
The Trace
command logs all standard output and standard error to a lgo file and echoes it to its own standard output, and it reports on the 'environment' in a broad sense (environment variables, ulimit settings, date, time, command, current directory, user/groups, etc). It took just under 10 minutes to run the complete set of tests, three-quarters of which was spent running the OP's script.