Merging sorted multiple files into 1 sorted file

Question

I have to merge multiple sorted files into 1 sorted file. Presently I am reading one data entry of each file (12 bytes), writing the maximum into the output file and incrementing the pointer of the input file from which I copied the data, much like the merge step of merge sort.

ie

while (n){
    //read one data element (12 bytes)
    //find maximum
    //write the maximum
    //increment file pointer containing of file containng maximum
}

This process is being repeated n=approx 3000000 times. The problem is that it is taking a large amount of time (approx 30 seconds). I would like to reduce the taken taken to under 10 sec. Any ideas on how it can be done?

cant i do it on my existing computer using the file handling more efficiently ? — Saad Hussain, Sep 15 '13 at 12:27
As you have *not* provided any code how do you think that we can advise you on your file handling? Mystic Meg is out of the office at the moment! — Ed Heal, Sep 15 '13 at 12:28
:D sorry about that i am just using fread to read 12 bytes of data and fwrite to write it back.. — Saad Hussain, Sep 15 '13 at 12:32
Without seeing *real code*, is there any answers *besides* "use a large forward cache on each file" you're expecting? Standard file stream buffering will only get you so-far, but the model you're presenting here all-but-ensures if at least two (or worse, all three) files share the same physical spindle, you're going to "butterfly" between each file's "current" when that file's buffer becomes exhausted. — WhozCraig, Sep 15 '13 at 12:32
@saadhussain - We are not here to write code for you and your homework. Just post what problems you have with your current "stab in the dark" — Ed Heal, Sep 15 '13 at 12:39
I just did (and so did arash below). Honestly with 3000000 entries of 12-bytes each you're only consuming approx 34.33 MB *total* anyway, so full-file-reading *both* files in single contiguous reads, then writing to the output file (no more reads required) during the merge seems quite feasible. — WhozCraig, Sep 15 '13 at 12:39
@saadhussain: You can still use [coreutils for windows](http://gnuwin32.sourceforge.net/packages/coreutils.htm) (and if you do, use sort's `-m` option) — Hasturkun, Sep 15 '13 at 14:23
How many input files do you have to merge? Do you have to (or want to) write the code as part of a learning exercise, or do you just want to get the job done? Is this a one-off operation, or something you are going to do often? — Jonathan Leffler, Sep 15 '13 at 15:56
@hasturkun i cant use sort also because my file have a header followed by data and i only need to sort the data part — Saad Hussain, Sep 15 '13 at 18:42
@JonathanLeffler i have around 10 files now but it can be close to 100 also i need to use this operation repeatedly — Saad Hussain, Sep 15 '13 at 18:44
@saadhussain: I was actually going to suggest that you check out sort's implementation in coreutils. Might give you an idea or two. — Hasturkun, Sep 15 '13 at 19:55

score 2 · Answer 1 · answered Sep 15 '13 at 12:31

2

you should not read 12 byte a time from files. read a buffers of 4k or so from each file and use buffers for sort, refresh buffers whenever necessary. also write output in chunks of 4k or so

if file sizes are small enough load the entire files into memory, sort, then write output.

answered Sep 15 '13 at 12:31

arash kordi

2,470
1
22
24

1

If he's using `fread()` and `fwrite()` and doing sequential access (which he is), then the standard I/O system will be doing regular I/O buffering, even though the calls in the program are for a dozen bytes at a time. – Jonathan Leffler Sep 15 '13 at 15:38

Jonathan Leffler · Answer 2 · 2013-09-16T14:28:57.677

Data generation

I created 10 files each with 300,000 lines, each line containing 11 characters (mainly digits) and a newline, using this Perl script:

#!/usr/bin/env perl
use strict;
use warnings;

# Generate sorted, merge-worthy data

die "Usage: num-files lines-per-file" unless scalar(@ARGV) == 2;
my $num_files = $ARGV[0];
my $num_lines = $ARGV[1];
die "Enter a number of files between 2 and 999" unless $num_files >= 2 && $num_files <= 999;
die "Enter a number of lines between 2 and 9,999,999" unless $num_files >= 2 && $num_files <= 9_999_999;

my $base = "datafile";
my $digits = length($num_files . '');
my $format = "$base.%.${digits}d";

foreach my $i (1..$num_files)
{
    my $file = sprintf $format, $i;
    open my $fh, '>', $file or die "Failed to create $file";
    foreach my $n (1..$num_lines)
    {
        printf $fh "%.7d-%.3d\n", $n + 60 * $i, $i;
    }
    close $fh;
}

It took Perl 5.18.0 about 1.5 seconds to generate the 10 files (on a MacBook Pro, 2.3 GHz Intel Core i7, 16 GiB 1333 MHz RAM, good old-fashioned spinning magnetic disk; useful, but not incredibly powerful). The data is designed so that there is overlap between the files, but also each line is unique and identifies the file which it comes from (that's the purpose of the $offset and putting the sequential number within the file before the file number). In testing 3 files, say, there is some data from file 1 alone before there is data from file 1 and 2 interleaved, and then from all 3 files, then file 1 runs out, and file 2. It gives decent coverage of the different parts of the merge algorithm (but you could create more scenarios).

I then created a merge program as shown in the 'Code' section below. There's nothing very fancy about it. It lifts some heap management algorithms from Jon Bentley's 'More Programming Pearls'; the originals are quoted as comments. The only tricky bit is the sense of the comparison routine: that caused me some wrong answers to start with, and the levels of indirection.

Timing Results

When run on the 10 files representing ~35 MiB of data:

$ ls -l datafile.??
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.01
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.02
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.03
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.04
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.05
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.06
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.07
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.08
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 14:09 datafile.09
-rw-r--r--  1 jleffler  staff  3600000 Sep 15 20:37 datafile.10
$ 
$ for file in datafile.??; do echo $file; sort -c $file; done
datafile.01
datafile.02
datafile.03
datafile.04
datafile.05
datafile.06
datafile.07
datafile.08
datafile.09
datafile.10
$ time ./merge datafile.?? > datafile.out

real    0m0.510s
user    0m0.314s
sys     0m0.074s
$ time ./merge datafile.?? > datafile.out

real    0m0.513s
user    0m0.317s
sys     0m0.080s
$ time sort -m datafile.?? > datafile.merge

real    0m10.061s
user    0m9.960s
sys     0m0.083s
$ time sort -m datafile.?? > datafile.merge

real    0m10.039s
user    0m9.921s
sys     0m0.082s
$ ls -l datafile.*
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.01
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.02
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.03
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.04
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.05
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.06
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.07
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.08
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 14:09 datafile.09
-rw-r--r--  1 jleffler  staff   3600000 Sep 15 20:37 datafile.10
-rw-r--r--  1 jleffler  staff  36000000 Sep 15 20:42 datafile.merge
-rw-r--r--  1 jleffler  staff  36000000 Sep 15 20:41 datafile.out
$ cmp datafile.out datafile.merge
$ sort -c datafile.out
$

Interpreting these results:

The first ls listing shows 10 data files.
The for loop checks that the input files are individually in sorted order.
The merge program I wrote runs in about 0.5s.
The system sort in merge mode (-m) runs in about 10s (and I'm astonished that it takes so long when my non-optimized code handles it so much quicker).
The second ls listing shows that the output files are the same size.
The cmp command shows that the output files are identical.
The sort -c command shows that the output files are in sorted order.

I separately and subsequently created 101 files, each with 1,000,000 records of 12 bytes each, in 52 seconds. I merged the files in 20 seconds (generating approximately 1179 MiB of output file — 1,236,000,000 bytes). The system sort merged them in 467 (7m47s) seconds (aka 'forever'). It took about 90 seconds for sort -c to check the output file was in sorted order. It took cmp less than 2 seconds to compare the two large files.

I'm intrigued that the system sort is so slow on merge.

Interpretation of the results

On my Mac, it is possible to merge ~35 MiB of data from 10 input files in about half a second.
The system sort can do the job in about ten seconds.
By inference (given that my machine is no longer a rocket, if it ever was), it should be possible to merge ~35 MiB on a Windows machine in under twenty seconds (allowing you a factor of 40 disparity in performance, which I don't think you should need).
So, your code taking thirty seconds is taking far too long — and the question is 'why'. You should look hard at your algorithms and data structures.

Code

#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct File
{
    const char *file;
    FILE       *fp;
    char       *line;
    size_t      reserve;    /* Space allocated for line */
    size_t      length;     /* Space used by current line */
} File;

extern void err_exit(const char *fmt, ...);
extern void read_line(File *current);
extern void write_line(File *current);
extern void heapify(size_t *heap, size_t heap_size, File *inputs);
extern void siftup(size_t *heap, size_t lo, size_t hi, File *inputs);
extern void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs);
extern int  compare(File *inputs, size_t i1, size_t i2);

const char *arg0;

int main(int argc, char **argv)
{
    File inputs[argc];
    arg0 = argv[0];

    /* Open files for reading - load first lines */
    for (int i = 1; i < argc; i++)
    {
        File *current = &inputs[i];
        current->file = argv[i];
        if ((current->fp = fopen(current->file, "r")) == 0)
            err_exit("failed to open file %s for reading", current->file);
        current->reserve = 128;
        if ((current->line = malloc(current->reserve)) == 0)
            err_exit("failed to allocate %zu bytes memory", current->reserve);
        current->length = 0;
        read_line(current);
    }

#if defined(CHECK_FUNDAMENTALS)
    for (int i = 1; i < argc; i++)
        printf("%d: %zu - %s\n", i, inputs[i].length, inputs[i].file);
#endif

    size_t heap_size = argc - 1;
    size_t heap[argc];     // heap[0] unused

    for (int i = 1; i < argc; i++)
        heap[i] = i;

#if defined(CHECK_FUNDAMENTALS)
    printf("Heap before:\n");
    for (int i = 1; i < argc; i++)
        printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif

    heapify(heap, heap_size, inputs);

#if defined(CHECK_FUNDAMENTALS)
    printf("Heap after:\n");
    for (int i = 1; i < argc; i++)
        printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif

#if defined(CHECK_FUNDAMENTALS)
    printf("Compare two lines:\n");
    printf("1: %s\n", inputs[1].line);
    printf("2: %s\n", inputs[2].line);
    int r12 = compare(inputs, 1, 2);
    int r21 = compare(inputs, 2, 1);
    int r11 = compare(inputs, 1, 1);
    printf("1 vs 2: %d\n", r12);
    printf("2 vs 1: %d\n", r21);
    printf("1 vs 1: %d\n", r11);
#endif

    while (heap_size > 0)
    {
        File *current = &inputs[heap[1]];
        write_line(current);
        read_line(current);
        if (current->line == 0)
            heap[1] = heap[heap_size--];
        if (heap_size > 0)
        {
            siftdown(heap, 1, heap_size, inputs);
#if defined(CHECK_FUNDAMENTALS)
            printf("Heap check:\n");
            for (int i = 1; i < argc; i++)
                printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line);
#endif
        }
    }

    return 0;
}

int compare(File *inputs, size_t i1, size_t i2)
{
    return strcmp(inputs[i1].line, inputs[i2].line);
}

void heapify(size_t *heap, size_t heap_size, File *inputs)
{
    for (size_t i = 1; i <= heap_size; i++)
        siftup(heap, 1, i, inputs);
}

/* J Bentley: More Programming Pearls
**
** Heap in array x, indexed from 1, not 0 as is normal in C.
** Implementation will allocate but not use array[0].
**  
**  function siftup(l, u,    i, p) {
**          # pre  maxheap(l, u-1)
**          # post maxheap(l, u)
**          i = u
**          while (1) {
**                  # maxheap(l, u) except between i and its parent
**                  if (i <= l) break
**                  p = int(i/2)
**                  if (x[p] >= x[i]) break
**                  swap(p, i)
**                  i = p
**          }
**  }
**  
**  function siftdown(l, u,  i, c) {
**          # pre  maxheap(l+1, u)
**          # post maxheap(l,u)
**          i = l
**          while (1) {
**                  # maxheap(l, u) except between i and its children
**                  c = 2*i
**                  if (c > u) break
**                  if (c + 1 <= u && x[c+1] > x[c]) c++
**                  if (x[i] >= x[c]) break
**                  swap(c, i)
**                  i = c
**          }
**  }
*/

void siftup(size_t *heap, size_t lo, size_t hi, File *inputs)
{
    size_t i = hi;
    while (1)
    {
        if (i <= lo)
            break;
        size_t p = i / 2;
        if (compare(inputs, heap[p], heap[i]) <= 0)
            break;
        size_t t = heap[p];
        heap[p] = heap[i];
        heap[i] = t;
        i = p;
    }
}

void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs)
{
    size_t i = lo;
    while (1)
    {
        size_t c = 2 * i;
        if (c > hi)
            break;
        if (c + 1 <= hi && compare(inputs, heap[c+1], heap[c]) < 0)
            c++;
        if (compare(inputs, heap[i], heap[c]) <= 0)
            break;
        size_t t = heap[c];
        heap[c] = heap[i];
        heap[i] = t;
        i = c;
    }
}

void read_line(File *current)
{
    char buffer[4096];
    if (fgets(buffer, sizeof(buffer), current->fp) != 0)
    {
        size_t length = strlen(buffer) + 1;
        if (length > current->reserve)
        {
            void *space = realloc(current->line, length);
            if (space == 0)
                err_exit("failed to reallocate %zu bytes memory", length);
            current->line = space;
            current->reserve = length;
        }
        strcpy(current->line, buffer);
        current->length = length - 1;
    }
    else
    {
        fclose(current->fp);
        current->fp = 0;
        free(current->line);
        current->line = 0;
        current->length = 0;
        current->reserve = 0;
    }
}

void write_line(File *current)
{
    if (fwrite(current->line, sizeof(char), current->length, stdout) != current->length)
        err_exit("short write of line from %s of length %zu", current->file, current->length);
}

void err_exit(const char *fmt, ...)
{
    int errnum = errno;
    va_list args;
    fprintf(stderr, "%s: ", arg0);
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    if (errnum != 0)
        fprintf(stderr, " (%d: %s)", errnum, strerror(errnum));
    putc('\n', stderr);
    exit(EXIT_FAILURE);
}

The code is designed to handle lines up to 4 KiB long; it would not be hard to revise it to use POSIX getline() to handle even longer lines. It assumes that all the files can be open at once (that means an upper limit of around 250 input files on most machines without tweaking limits). It will stop if it can't open all the files rather than do multiple merges to intermediate files.

You have a paragraph that only says "The data is designed so that". — Hasturkun, Sep 16 '13 at 08:34
@Hasturkun: Oops — thanks for pointing that out. It's the downside of composing an answer over several sessions (and not proof-reading carefully enough). I've added some notes about how the data is organized. — Jonathan Leffler, Sep 16 '13 at 14:30

Merging sorted multiple files into 1 sorted file

2 Answers2

Data generation

Timing Results

Interpretation of the results

Code

Linked