2

This sort code fails for very large input file data because it takes too long for it to finish.

rewind(ptr);
j=0;
while(( fread(&temp,sizeof(temp),1,ptr)==1) &&( j!=lines-1)) //read object by object
{
  i=j+1;
  while(fread(&temp1,sizeof(temp),1,ptr)==1)  //read next object , to compare previous object with next object 
   {
       if(temp.key > temp1.key)   //compare key value of object 
           {
            temp2=temp; //if you don't want to change records and just want to change keys use three statements temp2.key =temp.key;
            temp=temp1;
            temp1=temp2;
            fseek(ptr,j*sizeof(temp),0);        //move stream to overwrite 
            fwrite(&temp,sizeof(temp),1,ptr);   //you can avoid above swap by changing &temp to &temp1 
            fseek(ptr,i*sizeof(temp),0);        //move stream to overwrite
            fwrite(&temp1,sizeof(temp),1,ptr);  //you can avoid above swap by changing &temp1 to &temp
           }
    i++; 
   }
  j++; 
  fseek(ptr,j*sizeof(temp),0);  
}

Any idea on how to make this C code much faster? Also would using qsort() (predefined in C) be much faster and how should be applied to the above code?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
  • Of course it goes slow, you're going to the hard drive every 2 nanoseconds. Try reading more than one record at a time, even if you can't read the whole thing. – Chris Hayes Sep 17 '13 at 00:10
  • How? Any code snippet for showing your comment better? – Mona Jalal Sep 17 '13 at 00:15
  • @MonaJalal Not a code snippet, but what do you think would happen if you read and wrote `1000*sizeof(temp)` bytes (for example) at a time, and did some of the sorting purely within RAM instead of on disk? – Dennis Meng Sep 17 '13 at 00:18
  • External sorting is too broad a subject for a Stack Overflow question. It is a subject for textbooks. – Eric Postpischil Sep 17 '13 at 00:41
  • 1
    Load all data into ram, sort and then write it back to file. If you absolutely must do the sorting directly on harddisk at least use mergesort and try and do some buffering. Quicksort is also possible but can be very slow if used on data that is already sorted. – Tesseract Sep 17 '13 at 00:46
  • You should be using external-memory mergesort. See here for more details: http://en.wikipedia.org/wiki/External_sorting – leif Sep 17 '13 at 01:23
  • How big is your 'large input'? How much RAM do you have on your machine? How big is `sizeof(temp)`? Technically, you need a call to `fseek()` after the second `fwrite()`; you have to make a positioning operation when you change between reading and writing or vice versa. You don't absolutely have to swap `temp` and `temp1` in memory, but it isn't currently a source of your performance problems. – Jonathan Leffler Sep 17 '13 at 02:20

2 Answers2

4

You asked the question Sorting based on key from a file and were given various answers about how to sort in memory. You added a supplemental question as an answer, and then created this question instead (which was correct).

Your code here is basically a disk-based bubble sort, with O(N2) complexity, and poor time performance because it is manipulating file buffers and disk. A bubble sort is a bad choice at the best of times — simple, yes, but slow.

The basic ways to speed up sorting programs are:

  1. If possible, read all the data into memory, sort in memory, and write the result out.
  2. If it won't all fit into memory, read as much into memory as possible, sort it, and write the sorted data to a temporary file. Repeat as often as necessary to sort all the data. Then merge the temporary files into one file. If the data set is truly astronomical (or the memory truly minuscule), you may have to create intermediate merge files. These days, though, you have to be sorting many hundreds of gigabytes for that to be an issue at all, even on a 32-bit computer.
  3. Make sure you choose a good sorting algorithm. Quick sort with appropriate pivot selection is very good. You could look up 'introsort' too.

You'll find example in-memory sorting code in the answers to the cross-referenced question (your original question). If you choose to write your own sort, you can consider whether to base the interface on the standard C qsort() function. If you write a Quick Sort, you should look at Quicksort — Choosing the pivot where the answers have copious references.

You'll find example merging code in the answer to Merging multiple sorted files into one file. The merging code out-performs the system sort program in its merge mode, which is intriguing since it is not highly polished code (but it is reasonably workmanlike).

You could look at the external sort program described in Software Tools, though it is a bit esoteric in that it is written in 'RatFor' or Rational Fortran. The design, though, is readily transferrable to other languages.

Community
  • 1
  • 1
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
2

Yes, by all means, use qsort(). Use it either as SpiderPig suggests by reading the whole file into memory, or as the in-memory sort for runs that do fit into memory preparing for a merge sort. Don't worry about the worst-case performance. A decent implementation will take a the median of (first, last, middle) to get fast sorting for the already-sorted and reverse-order pathological case, plus better average performance in the random case.

This all-in-memory example shows you how to use qsort:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef struct record_tag
{
    int     key;
    char    data[12];

} record_type, *record_ptr;
const record_type * record_cptr;

void create_file(const char *filename, int n)
{
    record_type buf;
    int i;
    FILE *fptr = fopen(filename, "wb");
    for (i=0; i<n; ++i)
    {
        buf.key = rand();
        snprintf(buf.data, sizeof buf.data, "%d", buf.key);
        fwrite(&buf, sizeof buf, 1, fptr);
    }
    fclose(fptr);
}

/* Key comparison function used by qsort(): */

int compare_records(const void *x, const void *y)
{
    const record_ptr a=(const record_ptr)x;
    const record_ptr b=(const record_ptr)y;
    return (a->key > b->key) - (a->key < b->key);
}

/* Read an input file of (record_type) records, sort by key field, and write to the output file */

void sort_file(const char *ifname, const char *ofname)
{
    const size_t MAXREC = 10000;
    int n;
    FILE    *ifile, *ofile;
    record_ptr buffer;

    ifile = fopen(ifname, "rb");
    buffer = (record_ptr) malloc(MAXREC*sizeof *buffer);
    n = fread(buffer, sizeof *buffer, MAXREC, ifile);
    fclose(ifile);

    qsort(buffer, n, sizeof *buffer, compare_records);

    ofile = fopen(ofname, "wb");
    fwrite(buffer, sizeof *buffer, n, ofile);
    fclose(ofile);
}

void show_file(const char *fname)
{
    record_type buf;
    int n = 0;
    FILE *fptr = fopen(fname, "rb");
    while (1 == fread(&buf, sizeof buf, 1, fptr))
    {
        printf("%9d : %-12s\n", buf.key, buf.data);
        ++n;
    }
    printf("%d records read", n);
}

int main(void)
{
    srand(time(NULL));

    create_file("test.dat", 99);
    sort_file("test.dat", "test.out");
    show_file("test.out");

    return 0;
}

Notice the compare_records function. The qsort() function needs a function that accepts void pointers, so those pointer must be cast to the correct type. Then the pattern:

(left > right) - (left < right)

...will return 1 if the left argument is greater, 0 if they are equal or -1 if the right argument is greater.

The could be improved. First, there is absolutely no error checking. That's not sensible in production code. Second, you could examine the input file to get the file size instead of guessing that it's less than some MAXxxx value. One way to do that is to use ftell. (Follow the link for a file size example.) Then, use that value to allocate a single buffer, just big enough to qsort the data.

If there is not enough room (if the malloc returns NULL) then you can fall back on sorting chunks (with qsort, as in the snippet) that do fit into memory, writing them to separate temporary files, and then merging them into a single output file. That's more complicated, and rarely done since there are sort/merge utility programs designed specifically for sorting large files.

Mike Housky
  • 3,959
  • 1
  • 17
  • 31