3

I need to sort stuff read from a file similar to the following:

Key: 2 rec:1 2 3 4 5 6 ...
Key: 3 rec:7 8 9 10 11 ...
Key: 1 rec:A B C D E F ...

becomes

Key: 1 rec:A B C D E F ...
Key: 2 rec:1 2 3 4 5 6 ...
Key: 3 rec:7 8 9 10 11 ...

and if we have something unsorted in rec (record ) of a key, it will remain unchanged! as sorting is based on the key. I want to use qsort() defined in C for sorting. I have an idea to use strtok for breaking each line read from file into manageable arrays but I am not if it's the best way to find the key number so as to sort them with qsort from C library.

P.S.: Each line of the input file includes one key like Key: 1 rec:A B C D E F ... Also we won't sort records within a key.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
  • 1
    I suggest looking at the `sort` command in unix. Your question is a bit unclear but `sort -n yourfile > sortedfile` might do what you want (sort a file numerically) – John Carter Sep 16 '13 at 04:18
  • Are the key sections of equal length? If they were, you could write a compare function that just sorted the heads of each line based on strncmp() ( and jump past the "Key: " with pointer magic ). Are you trying to avoid loading all the lines into memory? – n0741337 Sep 16 '13 at 04:19
  • @n0741337 I don't necessarily think so. Here's one example of an input file to the sort program http://paste.ubuntu.com/6113022/ ! What other method can be used for extracting the value of key and sorting each line of the file based on its value? – Mona Jalal Sep 16 '13 at 04:23
  • @therefromhere If we have something like this `Key: 1 rec:A B C D E F` we are going to leave it unsorted! We are not going to sort everything! We are just going to replaces lines with other (aka sort them) by the key value. – Mona Jalal Sep 16 '13 at 04:26
  • 1
    Check out `sort -k2,2 -n` then. Otherwise, because they aren't fixed length sections, you're going to need to identify those first two spaces yourself. For that, you could use pointers instead of strtok()'ing for your compare function. – n0741337 Sep 16 '13 at 04:36
  • 1
    If your job is to get the data sorted, use `sort -k2,2n` or `sort -k2,2 -n` and you're home. If your job is to write C code, you've got some work ahead of you...is the data set small enough to fit in memory, or do you have to worry about external sorting (reading as much data as will fit in memory, sorting it, writing it to a temporary file, repeat until all data has been read, then merge all the files)? If you can fit it all in memory, it isn't too painful. The biggest numbers fit into a 32-bit `int` (a signed one, at that, unless I misread the data). _[...continued...]_ – Jonathan Leffler Sep 16 '13 at 04:42
  • _[...continuation...]_ To process the data, read the lines; in your data structure, store the converted key number (as well as the original line data). Sort on the key number; write the data. That's all pretty straight-forward stuff. Your lines are fairly long; you can use `fgets()` with a large buffer (say 4 KiB) and then use `strdup()` to replicate just the amount that was used. Or you can use POSIX [`getline()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html) which will allocate the memory for you. – Jonathan Leffler Sep 16 '13 at 04:44
  • @JonathanLeffler We suppose the data is 32 bit int so hopefully we don't need to deal with external sorting! – Mona Jalal Sep 16 '13 at 04:46
  • The disk-based bubble sort that was added to the question by a moderator is now the subject of a separate question: [How to make sorting in C programs faster for large input sets?](http://stackoverflow.com/questions/18839308/how-to-make-sorting-in-c-programs-pretty-much-faster-for-the-large-input-sets) Consequently, I'm going to rollback the 'supplemental'. – Jonathan Leffler Sep 17 '13 at 04:04

3 Answers3

3

To do this in c, make use of sscanf and you can get a sort of regex to extract the integer needed:

int comp(const void *str1, const void *str2) {
    char *a = *(char **)str1, *b = *(char **)str2;
    int key1, key2;
    sscanf(a, "%*s%d", &key1);
    sscanf(b, "%*s%d", &key2);
    return key1-key2;
}

//Call the function qsort like so
qsort(/*char **/lines, /*int*/numElements, /*unsigned*/ sizeof (char*), comp);

Don't know how to use the regex library in c++, but sscanf still works. Full working example in c++11:

#include <iostream>
#include <cstdio>
#include <deque>
#include <string>
#include <algorithm>

int main() {

    //Using fstream, read in each line of the file into a string using getline(...)
    std::deque<std::string> lines = {
        "Key: 2 rec:1 2 3 4 5 6",
        "Key: 3 rec:7 8 9 10 11",
        "Key: 1 rec:A B C D E F",
        "Key: 4 rec:1 2 3 4 5 6"
    }; //Store each in a deque object

    //using std::sort
    std::sort(lines.begin(), lines.end(), []( const std::string &str1, const std::string &str2 ) {
        int key1, key2;
        sscanf(str1.c_str(), "%*s%d", &key1);
        sscanf(str2.c_str(), "%*s%d", &key2);
        return (key1 < key2);
    });


    for (auto sortedkeys: lines)
        std::cout << sortedkeys << "\n";
    return 0;
}
smac89
  • 39,374
  • 15
  • 132
  • 179
  • what is the equivalent of this program in C ? I am not supposed to use C++. Also I know how to make use of qsort and how to write the compare function in C. I mainly need to know how to get to the number exactly after key: and how to sort all the lines based on the keys. thanks a ton. – Mona Jalal Sep 16 '13 at 05:41
  • I have showed you how the compare function should look like, so all you have to do now is fill an array with all the lines in the file you are reading from and have a count of the number of lines read; then call qsort using the way I showed. Should be easy enough – smac89 Sep 16 '13 at 05:44
1

IF the key lengths are not same you should avoid usage of strncmp and read line by line and then get key value by using loop from line[5] to next space(or else use strtok with delimiter of space).

Repeat this till EOF. store key values in array or list.

Next sort array or list.

Now find the value of Key from sorted array in your file by using strstr and copy matched line into new file. before using strstr convert key into string.

if you want to avoid coping into new files need to move file pointer between the lines using fseek and Modify lines.

Gangadhar
  • 10,248
  • 3
  • 31
  • 50
  • I am agree with your answer that we should find line[5] because in "key: 1804289383" 180...starts from line[5]. How to you take out lines from a file and put them into string? do we have any such get_line predefined function in C? Also I think if I use line[5] method, I kind of don't need to use strtok basically, right? – Mona Jalal Sep 16 '13 at 04:51
  • Define MAX_LINE. #define `MAX_LINE maximum length of line+1` . Declare a string `char line[MAX_LINE]`.first open file in r+ mode with the help of `fopen()` ex: `fp=fopen("file","r+")` and then use `fgets()` ex:`fgets(line,MAX_LINE,fp)`to read the line from file. the string line is used. and you need not to use strtok if use line[5]to next space. – Gangadhar Sep 16 '13 at 04:58
1

If you must write C, it needn't be all that long or complicated. You could simplify it more than this if you skimp on the error checking.

#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern void err_exit(const char *fmt, ...);

typedef struct data
{
    char *line;
    int   key;
} data;

static int cmp_data(const void *v1, const void *v2)
{
    const data *d1 = v1;
    const data *d2 = v2;
    if (d1->key < d2->key)
        return -1;
    else if (d1->key > d2->key)
        return +1;
    else
        return 0;
}

int main(void)
{
    char buffer[4096];
    data *array = 0;
    size_t array_len = 0;
    size_t array_max = 0;

    while (fgets(buffer, sizeof(buffer), stdin) != 0)
    {
        if (array_len >= array_max)
        {
            size_t new_size = (array_max + 2) * 2;
            void *space = realloc(array, new_size * sizeof(data));
            if (space == 0)
                err_exit("Out of memory (1)");
            array = space;
            array_max = new_size;
        }
        array[array_len].line = strdup(buffer);
        if (array[array_len].line == 0)
            err_exit("Out of memory (2)");
        if (sscanf(array[array_len].line, "%*s %d", &array[array_len].key) != 1)
            err_exit("Format error - no number in right place in: %.20s...\n",
                     array[array_len].line);
        //printf("%3zu:%.10d: %s", array_len, array[array_len].key,
        //       array[array_len].line);
        array_len++;
    }

    qsort(array, array_len, sizeof(data), cmp_data);

    for (size_t i = 0; i < array_len; i++)
        fputs(array[i].line, stdout);

    return 0;
}

void err_exit(const char *fmt, ...)
{
    int errnum = errno;
    va_list args;
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    if (errnum != 0)
        fprintf(stderr, " (%d: %s)", errnum, strerror(errnum));
    putc('\n', stderr);
    exit(EXIT_FAILURE);
}

keysort — overwriting files as they are sorted

#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void sort_file(const char *i_file, const char *o_file);

int main(int argc, char **argv)
{
    if (argc > 1)
    {
        for (int i = 1; i < argc; i++)
            sort_file(argv[i], argv[i]);
    }
    else
        sort_file("/dev/stdin", "/dev/stdout");
    return 0;
}

typedef struct data
{
    char *line;
    int   key;
} data;

static int cmp_data(const void *v1, const void *v2)
{
    const data *d1 = v1;
    const data *d2 = v2;
    if (d1->key < d2->key)
        return -1;
    else if (d1->key > d2->key)
        return +1;
    else
        return 0;
}

static void err_exit(const char *fmt, ...)
{
    int errnum = errno;
    va_list args;
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    if (errnum != 0)
        fprintf(stderr, " (%d: %s)", errnum, strerror(errnum));
    putc('\n', stderr);
    exit(EXIT_FAILURE);
}

void sort_file(const char *i_file, const char *o_file)
{
    char buffer[4096];
    data *array = 0;
    size_t array_len = 0;
    size_t array_max = 0;

    FILE *i_fp = fopen(i_file, "r");
    if (i_fp == 0)
        err_exit("Failed to open file %s for reading", i_file);

    while (fgets(buffer, sizeof(buffer), i_fp) != 0)
    {
        if (array_len >= array_max)
        {
            size_t new_size = (array_max + 2) * 2;
            void *space = realloc(array, new_size * sizeof(data));
            if (space == 0)
                err_exit("Out of memory (1)");
            array = space;
            array_max = new_size;
        }
        array[array_len].line = strdup(buffer);
        if (array[array_len].line == 0)
            err_exit("Out of memory (2)");
        if (sscanf(array[array_len].line, "%*s %d", &array[array_len].key) != 1)
            err_exit("Format error - no number in right place in: %.20s...\n",
                     array[array_len].line);
        //printf("%3zu:%.10d: %s", array_len, array[array_len].key,
        //       array[array_len].line);
        array_len++;
    }
    fclose(i_fp);

    qsort(array, array_len, sizeof(data), cmp_data);

    FILE *o_fp = fopen(o_file, "w");
    if (o_fp == 0)
        err_exit("Failed to open file %s for writing", o_file);
    for (size_t i = 0; i < array_len; i++)
        fputs(array[i].line, o_fp);
    fclose(o_fp);
}

If your system doesn't support /dev/stdin and /dev/stdout, then you have to modify the interface to sort_file(), probably to:

void sort_file(const char *i_file, FILE *ifp, const char *o_file, FILE *ofp);

You then decide that if ifp is not null, you use it for input — otherwise you open the file specified by i_file. Similarly for output: if ofp is not null, you use it — otherwise, you open the file specified by o_file. The changes to main() and in the body of sort_file() are trivial.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • In the above code where do you deal with FILE pointer and filename? This sorting needs to read data from file and then do some intermediate calculation and then again write back to the file. Please let me know if you know of a version (or can modify yours) in a way to deal with files in C as well. Thanks. – Mona Jalal Sep 16 '13 at 19:36
  • The code reads from `stdin` (see the `fgets()` line) and writes to `stdout` (see the `fputs()` line). If you want to read from files, you have to arrange to wrap the code in a suitable loop that opens the named files, reads from them, and closes them. If you want to overwrite the original input files (are you sure?) then you will open the file for writing (and write, and close it). Routine changes...not very hard (10 minutes work, perhaps, on the code, with some testing). Mainly needs some extra data to work with — testing one file several times over is ... not very informative. – Jonathan Leffler Sep 16 '13 at 20:16
  • @JonathanLeffler, why not use [freopen](http://www.tutorialspoint.com/c_standard_library/c_function_freopen.htm)? That way, you just have to use printf and scanf instead of worrying about all those FILE* pointer objects – smac89 Sep 16 '13 at 21:09
  • @Smac89: I've never bothered; I don't like losing standard input and standard output, and writing code to use given file pointers is normal to me. Each to their own. I guess it would work (I can see no reason why it wouldn't work), but it isn't the way I've ever done it, and I'm not sure I'll change now. – Jonathan Leffler Sep 16 '13 at 21:26