Conditional splitting of a HUGE file

Question

I have a really huge file (>500 million lines) that I want to split in several smaller files according to the first 3 characters of one of its columns.

It looks like this, where each element of columns 1 and 2 is unique:

A0A023GPI8  A0A023GPI8.1    232300  1027923628
A0A023GPJ0  A0A023GPJ0.2    716541  765680613
A0A023PXA5  A0A023PXA5.1    559292  728048729
A0A023PXB0  A0A023PXB0.1    559292  728048786
A0A023PXB5  A0A023PXB5.1    559292  728048524
A0A023PXB9  A0A023PXB9.1    559292  728048769
A0A023PXC2  A0A023PXC2.1    559292  728050382

I used the following script thinking it would be quite fast, because it seemed to me that it involved a single reading of the whole file. However, it's been running for several days and it is far from being finished. Any idea to explain why, and solutions to propose?

while read line
do
    PREFIX=$(echo "$line" | cut -f2 | cut -c1-3)
    echo -e "$line" >> ../split_DB/$PREFIX.part
done < $file

"fast" and "bash" are mutually incompatible concepts. Awk would probably be several orders of magnitude faster, although it depends on how many output files you end up with. — rici, Nov 24 '17 at 15:39
How many unique prefixes are you expecting? It might be more efficient to pre-open the output files instead of reopening the same file over and over again. — chepner, Nov 24 '17 at 15:39
Bash is my language of choice for practically everything *except* cases where there will be huge/complex memory structures or bit-fiddly math or streaming of millions of lines, though it *can* do those things if you really need that. I'd use (perl|your other preference). — Paul Hodges, Nov 24 '17 at 15:44
Have you considered using [split](https://linux.die.net/man/1/split)? Just write a regex to capture the portion of the column that you want to use as the split mark. — dawg, Nov 24 '17 at 15:48
The reason it is slow is that for each of the 500 million lines, you are forcing your shell to create 3 processes, so your kernel is hard at work spawning 1.5 billion processes. Suppose it can handle 10 thousand processes a second; you’re still looking at 150 thousand seconds, which is 2 days. And 10k processes per second is fast; probably a factor of ten or more better than you’re getting. — Jonathan Leffler, Nov 24 '17 at 16:10

score 3 · Answer 1 · edited Apr 19 '22 at 20:35

3

read isn't terribly efficient; it has to read one character at a time to avoid reading past the next newline character. However, a big source of overhead here is calling cut twice on each line. We can avoid that by using read again to do your splitting, and using parameter expansion to extract the first character of the second column.

while read -r line; do
    read -r _ col2 _ <<< "$line"
    prefix=${col2:0:3}
    # If the first column has a fixed width, you can forgo the
    # previous two lines and use
    #   prefix=${line:12:3}
    printf '%s\n' "$line" >> ../split_DB/$prefix.part
done < "$file"

However, I wouldn't spend too much time trying to do this efficiently in bash: here's a quick-and-dirty Python script that will do the same thing:

import functools

# Cache the output file handles, so that each is opened only once.
@functools.lru_cache(None)
def open_output_file(path):
    return open(path, 'a')

with open(file) as fh:
    for line in fh:
        cols = line.strip().split()
        prefix = cols[1][0:3]
        open_output_file(f"../split_DB/{prefix}.part").write(line)

edited Apr 19 '22 at 20:35

benrg

1,395
11
13

answered Nov 24 '17 at 15:38

chepner

497,756
71
530
681

His index is the first three characters of the of the second column. You have written both of these to take only one character. Otherwise, this is great. – dawg Nov 24 '17 at 15:52
I was afraid that with a python script all the "subfiles" will remained opened until the end of the process, so the 17G of the main file will be in memory. However I will test this solution and compare its speed. – Lucas A Nov 24 '17 at 15:54
@dawg Ah, right. I think the paste here is converting tabs to spaces, which made me think the `cut 1-3` was to remove the two spaces *before* the second column. Updated both scripts to handle a 3-character prefix. – chepner Nov 24 '17 at 15:54
3

@LucasA Python isn't reading the entire main file into memory; it only reads one line at a time. (Or rather, it fills some fixed-size buffer with multiple lines, but the memory usage is constant in any case.) Likewise, the subfiles are all open but not consuming any memory behind whatever fixed-sized write buffer they use. – chepner Nov 24 '17 at 15:55

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

It is potentially as easy as:

$ awk '{s=substr($2,1,3); print >> s}' file

The >> redirects the print to appending a file by the name given. The name is formed by the first 3 letters of the second column.

This will be monumentally faster than Bash dealing with this file.

Note:

Usually an OS does have a limit on the number of simultaneous files open. This may be an issue depending on the number of potential character combinations in the first 3 characters of the second column. This will effect any solution where the files of those names remain open while processing the given file -- not just awk.

If you have 000 to 999 that is 999 potential files open; if you have AAA to ZZZ that is 17,575; if you have three alphanumeric with upper and lower case, that is 238,327 potential open files... If you data has only a few unique prefixes, you may not need to worry about this; If you state the details of the data, the solutions suggested here may be different.

(You can calculate the potential combinations with a base conversion of 'ZZZ' into decimal based on the length of the alphabet allowed in the 3 characters. ('0'..'9','A'..'Z') is base 32 ('0'..'9','a'..'z','A'..'Z') is base 62 and so on.)

You can raise the limit with most Unix style OSs if need be (within reason) or open and close new files as needed. Raising the file limit to 238,327 would be impractical. You could also sort the data and close the previous file as it goes out of use.

Simplest solution, nice. Are we sure that files are kept open after performing a "print >> afile"? If not, it could be slow to re-append many many times to the same file. So, perhaps it could be better to read several times the source file, but write every split file completely. — linuxfan says Reinstate Monica, Nov 24 '17 at 19:01
@linuxfan: gawk keeps the files open as long as there are free file descriptor slots. Also, see http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_14 — rici, Nov 24 '17 at 19:46
Yes, the files stay open unless explicitly closed with `awk` — dawg, Nov 24 '17 at 20:10

Jonathan Leffler · Answer 3 · 2017-11-27T15:58:18.520

Why the shell script is slow

The reason it is slow is that for each of the 500 million lines, you are forcing your shell to create 3 processes, so your kernel is hard at work spawning 1.5 billion processes. Suppose it can handle 10 thousand processes a second; you’re still looking at 150 thousand seconds, which is 2 days. And 10k processes per second is fast; probably a factor of ten or more better than you’re getting. On my 2016 15" MacBook Pro running macOS High Sierra 10.13.1, with a 2.7 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3, and 500 GB Flash storage (about 150 GB free), I was getting around 700 processes per second, so the script would nominally take almost 25 days to run through 500 million records.

Ways to speed it up

There are ways to make the code faster. You could use plain shell, or Awk, or Python, or Perl. Note that if you use Awk, it needs to be GNU Awk, or at least not BSD (macOS) Awk — the BSD version simply decided it hadn't got enough file descriptors.

I used a random data generator to create a file with 100,000 random entries somewhat similar to those in the question:

E1E583ZUT9  E1E583ZUT9.9    422255  490991884
Z0L339XJB5  Z0L339XJB5.0    852089  601069716
B3U993YMV8  B3U993YMV8.7    257653  443396409
L2F129EXJ4  L2F129EXJ4.8    942989  834728260
R4G123QWR2  R4G123QWR2.6    552467  744905170
K4Z576RKP0  K4Z576RKP0.9    947374  962234282
Z4R862HWX1  Z4R862HWX1.4    909520  2474569
L5D027SCJ5  L5D027SCJ5.4    199652  773936243
R5R272YFB5  R5R272YFB5.4    329247  582852318
G1I128BMI2  G1I128BMI2.6    359124  404495594

(The command used is a home-brew generator that's about to get a rewrite.) The first two columns have the same 10 leading characters in the pattern X#X###XXX# (X for letter, # for digit); the only difference is in the suffix .#. This isn't exploited in the scripts; it doesn't matter in the slightest. There's also no guarantee that the values in the second column are unique, nor that the .1 entry appears for a key if the .2 entry appears, etc. These details are mostly immaterial to the performance measurements. Because of the letter-digit-letter prefix used for the file names, the 26 * 10 * 26 = 6760 possible file prefixes. With the 100,000 randomly generated records, every one of those prefixes is present.

I wrote a script to time various ways of processing the data. There are 4 shell script variants — the one posted by Lucas A, the OP; two posted by chepner (one as comments), and one I created. There's also the Awk script created by dawg, a mildly modified version of the Python 3 script posted by chepner, and a Perl script I wrote.

Results

The results can be summarized by this table (run-time measured in seconds of elapsed time or wall clock time):

╔═════════════════╦════╦═════════╦═════════╦═════════╦═════════╗
║  Script Variant ║  N ║    Mean ║ Std Dev ║     Min ║     Max ║
╠═════════════════╬════╬═════════╬═════════╬═════════╬═════════╣
║   Lucas A Shell ║ 11 ║ 426.425 ║  16.076 ║ 408.044 ║ 456.926 ║
║ Chepner 1 Shell ║ 11 ║  39.582 ║   2.002 ║  37.404 ║  43.609 ║
║         Awk 256 ║ 11 ║  38.916 ║   2.925 ║  30.874 ║  41.737 ║
║ Chepner 2 Shell ║ 11 ║  16.033 ║   1.294 ║  14.685 ║  17.981 ║
║   Leffler Shell ║ 11 ║  15.683 ║   0.809 ║  14.375 ║  16.561 ║
║     Python 7000 ║ 11 ║   7.052 ║   0.344 ║   6.358 ║   7.771 ║
║        Awk 7000 ║ 11 ║   6.403 ║   0.384 ║   5.498 ║   6.891 ║
║       Perl 7000 ║ 11 ║   1.138 ║   0.037 ║   1.073 ║   1.204 ║
╚═════════════════╩════╩═════════╩═════════╩═════════╩═════════╝

The original shell script is 2.5 orders of magnitude slower than Perl; Python and Awk have almost the same performance when there are enough file descriptors available (Python simply stops if there aren't enough file descriptors available; so does Perl). The shell script can be made about about half as fast as Python or Awk.

The 7000 denotes the number of open files needed (ulimit -n 7000). This is because there are 26 * 10 * 26 = 6760 different 3-character starting codes in the generated data. If you have more patterns, you'll need more open file descriptors to gain the benefit of keeping them all open, or you will need to write a file descriptor caching algorithm somewhat like the one that GNU Awk must be using, with the consequential performance loss. Note that if the data was presented in sorted order, so that all the entries for each file were presented in sequence, then you'd be able to tweak the algorithms so that only one output file was open at a time. The random generated data is not in sorted order, so it hits any caching algorithm hard.

Scripts

Here are the various scripts tested during this exercise. These, and much of the supporting material, is available on GitHub in soq/src/so-4747-6170. Not all the code used is present in GitHub.

Lucas A Shell — aka `opscript.sh`

cat "$@" |
while read line
do
    PREFIX=$(echo "$line" | cut -f2 | cut -c1-3)
    echo -e "$line" >> split_DB/$PREFIX.part
done

This is a not-entirely useless use of cat (see UUoC — Useless Use of cat for the comparison). If no arguments are provided, it copies standard input to the while loop; if any arguments are provided, they're treated as file names and passed to cat and it copies the contents of those files to the while loop. The original script had a hard-wired < file in it. There is no measurable performance cost to using cat here. A similar change was needed in Chepner's shell script.

Chepner 1 Shell — aka `chepner-1.sh`

cat "${@}" |
while read -r line; do
    read -r _ col2 _ <<< "$line"
    prefix=${col2:0:3}
    printf '%s\n' "$line" >> split_DB/$prefix.part
done

Chepner 2 Shell — aka `chepner-2.sh`

cat "${@}" |
while read -r line; do
    prefix=${line:12:3}
    printf '%s\n' "$line" >> split_DB/$prefix.part
done

Leffler Shell — aka `jlscript.sh`

sed 's/^[^ ]*  \(...\)/\1 &/' "$@" |
while read key line
do
    echo "$line" >> split_DB/$key.part
done

Awk Script - aka `awkscript.sh`

exec ${AWK:-awk} '{s=substr($2,1,3); print >> "split_DB/" s ".part"}' "$@"

This wins hands down for compactness of script, and has decent performance when run with GNU Awk and with enough available file descriptors.

Python Script — aka `pyscript.py`

This is a Python 3 script, a mildly modified version of what Chepner posted.

import fileinput

output_files = {}
#with open(file) as fh:
#    for line in fh:
for line in fileinput.input():
    cols = line.strip().split()
    prefix = cols[1][0:3]
    # Cache the output file handles, so that each is opened only once.
    #outfh = output_files.setdefault(prefix, open("../split_DB/{}.part".format(prefix), "w"))
    outfh = output_files.setdefault(prefix, open("split_DB/{}.part".format(prefix), "w"))
    print(line, file=outfh)

# Close all the output files
for f in output_files.values():
    f.close()

Perl Script — aka `jlscript.pl`

#!/usr/bin/env perl
use strict;
use warnings;

my %fh;

while (<>)
{
    my @fields = split;
    my $pfx = substr($fields[1], 0, 3);
    open $fh{$pfx}, '>>', "split_DB/${pfx}.part" or die
        unless defined $fh{$pfx};
    my $fh = $fh{$pfx};
    print $fh $_;
}

foreach my $h (keys %fh)
{
    close $fh{$h};
}

Test Script — aka `test-script.sh`

#!/bin/bash
#
# Test suite for SO 4747-6170

set_num_files()
{
    nfiles=${1:-256}
    if [ "$(ulimit -n)" -ne "$nfiles" ]
    then if ulimit -S -n "$nfiles" 
         then : OK
         else echo "Failed to set num files to $nfiles" >&2
              ulimit -HSa >&2
              exit 1
         fi
     fi
}

test_python_7000()
{
    set_num_files 7000
    timecmd -smr python3 pyscript.py "$@"
}

test_perl_7000()
{
    set_num_files 7000
    timecmd -smr perl jlscript.pl "$@"
}

test_awk_7000()
{
    set_num_files 7000
    AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript.sh "$@"
}

test_awk_256()
{
    set_num_files 256   # Default setting on macOS 10.13.1 High Sierra
    AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript-256.sh "$@"
}

test_op_shell()
{
    timecmd -smr sh opscript.sh "$@"
}

test_jl_shell()
{
    timecmd -smr sh jlscript.sh "$@"
}

test_chepner_1_shell()
{
    timecmd -smr bash chepner-1.sh "$@"
}

test_chepner_2_shell()
{
    timecmd -smr bash chepner-2.sh "$@"
}

shopt -s nullglob

# Setup - the test script reads 'file'.
# The SOQ global .gitignore doesn't permit 'file' to be committed.
rm -fr split_DB
rm -f file
ln -s generated.data file

# Ensure cleanup
trap 'rm -fr split_DB; exit 1' 0 1 2 3 13 15

for function in \
    test_awk_256 \
    test_awk_7000 \
    test_chepner_1_shell \
    test_chepner_2_shell \
    test_jl_shell \
    test_op_shell \
    test_perl_7000 \
    test_python_7000
do
    mkdir split_DB
    boxecho "${function#test_}"
    time $function file
    # Basic validation - the same information should appear for all scripts
    ls split_DB | wc -l
    wc split_DB/* | tail -n 2
    rm -fr split_DB
done

trap 0

This script was run using the command line notation:

time (ulimit -n 7000; TRACEDIR=. Trace bash test-script.sh)

The Trace command logs all standard output and standard error to a lgo file and echoes it to its own standard output, and it reports on the 'environment' in a broad sense (environment variables, ulimit settings, date, time, command, current directory, user/groups, etc). It took just under 10 minutes to run the complete set of tests, three-quarters of which was spent running the OP's script.

@MarkSetchell: Yes. I have updated the answer to note that explicitly, and stated that it is wall-clock time, not CPU time. Thanks! — Jonathan Leffler, Nov 26 '17 at 10:51
Waow great job. I tested ther perl solution and it does the job in a couple of hours, while my bash script is still running since last Friday... Thanks for the scripts and moreover, thanks a lot for the explanations — Lucas A, Nov 27 '17 at 10:58
@LucasA: glad you find it helpful. Perl is often the quickest alternative short of a custom C program. Note that if the data is presented in sorted order rather than completely at random as with the sample data I tested with, you don’t need exorbitant numbers of file descriptors. Adding sort into the pipeline didn’t change the overall running time by much, but did mean that it was no longer necessary to futz with the number of file descriptors. However, I was playing with 1/5000 scale files that could all fit in memory for sorting. At full scale, sorting probably uses disks, which is slower. — Jonathan Leffler, Nov 27 '17 at 14:57
If the data is generated in sorted order, you should exploit that. — Jonathan Leffler, Nov 27 '17 at 14:58
What a great answer. Thanks. One comment though: If I do the math, three digits of `('0'..'9','A'..'Z')` is essentially a Base32 decoding of `'ZZZ'` which is 46,655. If you extend that alphabet to three digits of `('0'..'9','a'..'z','A'..'Z')` it is 238,327 potential strings given three digits. (Just do a Base62 conversion of `'ZZZ'`) So I think the analysis of 7,000 files handle potentials is low. — dawg, Nov 27 '17 at 18:45
@dawg: Thanks. Given that the data only shows 7 rows and those only differ in the last 4 places of the 10 characters in the first and second columns, it is hard to know what the rules are. As I explained, I assumed alpha-digit-alpha giving the calculation 26 x 10 x 26. If the coding rules are different, the calculation is different. And as I also mentioned, if the data is sorted so that rows going to a file are presented consecutively, then the file descriptor limit ceases to be an issue I don’t like the 7000 number; but I (we) lack the information to do better. It could be sorted already. — Jonathan Leffler, Nov 27 '17 at 18:55
@JonathanLeffler : run it with `mawk` - preferably `mawk2`, instead of `gawk`, and there's a chance to be even faster than `perl` — RARE Kpop Manifesto, Jun 29 '22 at 14:33

Conditional splitting of a HUGE file

3 Answers3

Note:

Why the shell script is slow

Ways to speed it up

Results

Scripts

Lucas A Shell — aka `opscript.sh`

Chepner 1 Shell — aka `chepner-1.sh`

Chepner 2 Shell — aka `chepner-2.sh`

Leffler Shell — aka `jlscript.sh`

Awk Script - aka `awkscript.sh`

Python Script — aka `pyscript.py`

Perl Script — aka `jlscript.pl`

Test Script — aka `test-script.sh`

Linked

Conditional splitting of a HUGE file

3 Answers3

Note:

Why the shell script is slow

Ways to speed it up

Results

Scripts

Lucas A Shell — aka opscript.sh

Chepner 1 Shell — aka chepner-1.sh

Chepner 2 Shell — aka chepner-2.sh

Leffler Shell — aka jlscript.sh

Awk Script - aka awkscript.sh

Python Script — aka pyscript.py

Perl Script — aka jlscript.pl

Test Script — aka test-script.sh

Linked

Lucas A Shell — aka `opscript.sh`

Chepner 1 Shell — aka `chepner-1.sh`

Chepner 2 Shell — aka `chepner-2.sh`

Leffler Shell — aka `jlscript.sh`

Awk Script - aka `awkscript.sh`

Python Script — aka `pyscript.py`

Perl Script — aka `jlscript.pl`

Test Script — aka `test-script.sh`