Data generation
I created 10 files each with 300,000 lines, each line containing 11 characters (mainly digits) and a newline, using this Perl script:
#!/usr/bin/env perl
use strict;
use warnings;
# Generate sorted, merge-worthy data
die "Usage: num-files lines-per-file" unless scalar(@ARGV) == 2;
my $num_files = $ARGV[0];
my $num_lines = $ARGV[1];
die "Enter a number of files between 2 and 999" unless $num_files >= 2 && $num_files <= 999;
die "Enter a number of lines between 2 and 9,999,999" unless $num_files >= 2 && $num_files <= 9_999_999;
my $base = "datafile";
my $digits = length($num_files . '');
my $format = "$base.%.${digits}d";
foreach my $i (1..$num_files)
{
my $file = sprintf $format, $i;
open my $fh, '>', $file or die "Failed to create $file";
foreach my $n (1..$num_lines)
{
printf $fh "%.7d-%.3d\n", $n + 60 * $i, $i;
}
close $fh;
}
It took Perl 5.18.0 about 1.5 seconds to generate the 10 files (on a MacBook Pro, 2.3 GHz Intel Core i7, 16 GiB 1333 MHz RAM, good old-fashioned spinning magnetic disk; useful, but not incredibly powerful). The data is designed so that there is overlap between the files, but also each line is unique and identifies the file which it comes from (that's the purpose of the $offset
and putting the sequential number within the file before the file number). In testing 3 files, say, there is some data from file 1 alone before there is data from file 1 and 2 interleaved, and then from all 3 files, then file 1 runs out, and file 2. It gives decent coverage of the different parts of the merge algorithm (but you could create more scenarios).
I then created a merge
program as shown in the 'Code' section below. There's nothing very fancy about it. It lifts some heap management algorithms from Jon Bentley's 'More Programming Pearls'; the originals are quoted as comments. The only tricky bit is the sense of the comparison routine: that caused me some wrong answers to start with, and the levels of indirection.
Timing Results
When run on the 10 files representing ~35 MiB of data:
$ ls -l datafile.??
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.01
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.02
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.03
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.04
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.05
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.06
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.07
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.08
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.09
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 20:37 datafile.10
$
$ for file in datafile.??; do echo $file; sort -c $file; done
datafile.01
datafile.02
datafile.03
datafile.04
datafile.05
datafile.06
datafile.07
datafile.08
datafile.09
datafile.10
$ time ./merge datafile.?? > datafile.out
real 0m0.510s
user 0m0.314s
sys 0m0.074s
$ time ./merge datafile.?? > datafile.out
real 0m0.513s
user 0m0.317s
sys 0m0.080s
$ time sort -m datafile.?? > datafile.merge
real 0m10.061s
user 0m9.960s
sys 0m0.083s
$ time sort -m datafile.?? > datafile.merge
real 0m10.039s
user 0m9.921s
sys 0m0.082s
$ ls -l datafile.*
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.01
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.02
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.03
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.04
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.05
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.06
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.07
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.08
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.09
-rw-r--r-- 1 jleffler staff 3600000 Sep 15 20:37 datafile.10
-rw-r--r-- 1 jleffler staff 36000000 Sep 15 20:42 datafile.merge
-rw-r--r-- 1 jleffler staff 36000000 Sep 15 20:41 datafile.out
$ cmp datafile.out datafile.merge
$ sort -c datafile.out
$
Interpreting these results:
- The first
ls
listing shows 10 data files.
- The
for
loop checks that the input files are individually in sorted order.
- The
merge
program I wrote runs in about 0.5s.
- The system
sort
in merge mode (-m
) runs in about 10s (and I'm astonished that it takes so long when my non-optimized code handles it so much quicker).
- The second
ls
listing shows that the output files are the same size.
- The
cmp
command shows that the output files are identical.
- The
sort -c
command shows that the output files are in sorted order.
I separately and subsequently created 101 files, each with 1,000,000 records of 12 bytes each, in 52 seconds. I merged the files in 20 seconds (generating approximately 1179 MiB of output file — 1,236,000,000 bytes). The system sort
merged them in 467 (7m47s) seconds (aka 'forever'). It took about 90 seconds for sort -c
to check the output file was in sorted order. It took cmp
less than 2 seconds to compare the two large files.
I'm intrigued that the system sort
is so slow on merge.
Interpretation of the results
- On my Mac, it is possible to merge ~35 MiB of data from 10 input files in about half a second.
- The system
sort
can do the job in about ten seconds.
- By inference (given that my machine is no longer a rocket, if it ever was), it should be possible to merge ~35 MiB on a Windows machine in under twenty seconds (allowing you a factor of 40 disparity in performance, which I don't think you should need).
- So, your code taking thirty seconds is taking far too long — and the question is 'why'. You should look hard at your algorithms and data structures.
Code
#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct File
{
const char *file;
FILE *fp;
char *line;
size_t reserve; /* Space allocated for line */
size_t length; /* Space used by current line */
} File;
extern void err_exit(const char *fmt, ...);
extern void read_line(File *current);
extern void write_line(File *current);
extern void heapify(size_t *heap, size_t heap_size, File *inputs);
extern void siftup(size_t *heap, size_t lo, size_t hi, File *inputs);
extern void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs);
extern int compare(File *inputs, size_t i1, size_t i2);
const char *arg0;
int main(int argc, char **argv)
{
File inputs[argc];
arg0 = argv[0];
/* Open files for reading - load first lines */
for (int i = 1; i < argc; i++)
{
File *current = &inputs[i];
current->file = argv[i];
if ((current->fp = fopen(current->file, "r")) == 0)
err_exit("failed to open file %s for reading", current->file);
current->reserve = 128;
if ((current->line = malloc(current->reserve)) == 0)
err_exit("failed to allocate %zu bytes memory", current->reserve);
current->length = 0;
read_line(current);
}
#if defined(CHECK_FUNDAMENTALS)
for (int i = 1; i < argc; i++)
printf("%d: %zu - %s\n", i, inputs[i].length, inputs[i].file);
#endif
size_t heap_size = argc - 1;
size_t heap[argc]; // heap[0] unused
for (int i = 1; i < argc; i++)
heap[i] = i;
#if defined(CHECK_FUNDAMENTALS)
printf("Heap before:\n");
for (int i = 1; i < argc; i++)
printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif
heapify(heap, heap_size, inputs);
#if defined(CHECK_FUNDAMENTALS)
printf("Heap after:\n");
for (int i = 1; i < argc; i++)
printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif
#if defined(CHECK_FUNDAMENTALS)
printf("Compare two lines:\n");
printf("1: %s\n", inputs[1].line);
printf("2: %s\n", inputs[2].line);
int r12 = compare(inputs, 1, 2);
int r21 = compare(inputs, 2, 1);
int r11 = compare(inputs, 1, 1);
printf("1 vs 2: %d\n", r12);
printf("2 vs 1: %d\n", r21);
printf("1 vs 1: %d\n", r11);
#endif
while (heap_size > 0)
{
File *current = &inputs[heap[1]];
write_line(current);
read_line(current);
if (current->line == 0)
heap[1] = heap[heap_size--];
if (heap_size > 0)
{
siftdown(heap, 1, heap_size, inputs);
#if defined(CHECK_FUNDAMENTALS)
printf("Heap check:\n");
for (int i = 1; i < argc; i++)
printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif
}
}
return 0;
}
int compare(File *inputs, size_t i1, size_t i2)
{
return strcmp(inputs[i1].line, inputs[i2].line);
}
void heapify(size_t *heap, size_t heap_size, File *inputs)
{
for (size_t i = 1; i <= heap_size; i++)
siftup(heap, 1, i, inputs);
}
/* J Bentley: More Programming Pearls
**
** Heap in array x, indexed from 1, not 0 as is normal in C.
** Implementation will allocate but not use array[0].
**
** function siftup(l, u, i, p) {
** # pre maxheap(l, u-1)
** # post maxheap(l, u)
** i = u
** while (1) {
** # maxheap(l, u) except between i and its parent
** if (i <= l) break
** p = int(i/2)
** if (x[p] >= x[i]) break
** swap(p, i)
** i = p
** }
** }
**
** function siftdown(l, u, i, c) {
** # pre maxheap(l+1, u)
** # post maxheap(l,u)
** i = l
** while (1) {
** # maxheap(l, u) except between i and its children
** c = 2*i
** if (c > u) break
** if (c + 1 <= u && x[c+1] > x[c]) c++
** if (x[i] >= x[c]) break
** swap(c, i)
** i = c
** }
** }
*/
void siftup(size_t *heap, size_t lo, size_t hi, File *inputs)
{
size_t i = hi;
while (1)
{
if (i <= lo)
break;
size_t p = i / 2;
if (compare(inputs, heap[p], heap[i]) <= 0)
break;
size_t t = heap[p];
heap[p] = heap[i];
heap[i] = t;
i = p;
}
}
void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs)
{
size_t i = lo;
while (1)
{
size_t c = 2 * i;
if (c > hi)
break;
if (c + 1 <= hi && compare(inputs, heap[c+1], heap[c]) < 0)
c++;
if (compare(inputs, heap[i], heap[c]) <= 0)
break;
size_t t = heap[c];
heap[c] = heap[i];
heap[i] = t;
i = c;
}
}
void read_line(File *current)
{
char buffer[4096];
if (fgets(buffer, sizeof(buffer), current->fp) != 0)
{
size_t length = strlen(buffer) + 1;
if (length > current->reserve)
{
void *space = realloc(current->line, length);
if (space == 0)
err_exit("failed to reallocate %zu bytes memory", length);
current->line = space;
current->reserve = length;
}
strcpy(current->line, buffer);
current->length = length - 1;
}
else
{
fclose(current->fp);
current->fp = 0;
free(current->line);
current->line = 0;
current->length = 0;
current->reserve = 0;
}
}
void write_line(File *current)
{
if (fwrite(current->line, sizeof(char), current->length, stdout) != current->length)
err_exit("short write of line from %s of length %zu", current->file, current->length);
}
void err_exit(const char *fmt, ...)
{
int errnum = errno;
va_list args;
fprintf(stderr, "%s: ", arg0);
va_start(args, fmt);
vfprintf(stderr, fmt, args);
va_end(args);
if (errnum != 0)
fprintf(stderr, " (%d: %s)", errnum, strerror(errnum));
putc('\n', stderr);
exit(EXIT_FAILURE);
}
The code is designed to handle lines up to 4 KiB long; it would not be hard to revise it to use POSIX getline()
to handle even longer lines. It assumes that all the files can be open at once (that means an upper limit of around 250 input files on most machines without tweaking limits). It will stop if it can't open all the files rather than do multiple merges to intermediate files.