How to add the tail of a line to a hash table

Question

I am trying to use this code to add lines of file to a hash table. Currently it reads a file of the format.

289016164,279211721,462102225
341714666,132189021,299107290
362328497,466836829,47952622

That is with three comma separated lines. I would like it to be able to read lines of the format

289016164,279211721,462102225, some random text
341714666,132189021,299107290, some more random text
362328497,466836829,47952622, even more random text

The struct that should hold each line should be

typedef struct Row {
    uint32_t a;
    uint32_t b;
    uint32_t t;
    char text[40];
} Row;

The routine that reads in the file is called readAll (see below) and I am having problems modifying it to do this.

How can I change readAll to be able to cope with this new format?

I have included most of the code that uses readAll to give some context.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

// Should be 37% occupied with 50m entries
#define TABLE_SIZE 0x8000000
#define MASK (TABLE_SIZE - 1)
#define BUFFER_SIZE 16384
#define END_OF_FILE (-1)
#define DEFAULT_VALUE (-1)

typedef struct Row {
    uint32_t a;
    uint32_t b;
    uint32_t t;
} Row;

int32_t hash(int32_t a) {
    return a * 428916315;
}

void insert(Row * table, Row row) {
    long loc = hash(row.a) & MASK; // Entries are hashed on a
    long inc = 0;
    while (inc <= TABLE_SIZE) {
        loc = (loc + inc) & MASK;
        inc++;
        if (table[loc].a == DEFAULT_VALUE) {
            table[loc] = row;
            break;
        }
    }
}

int readChar(FILE * input, char * buffer, int * pos, int * limit) {
    if (*limit < *pos) {
        return buffer[(*limit)++];
    } else {
        *limit = 0;
        *pos = fread(buffer, sizeof(char), BUFFER_SIZE, input);
        if (*limit < *pos) {
            return buffer[(*limit)++];
        } else
            return END_OF_FILE;
    }
}

void readAll(char * fileName, Row * table) {
    char* buffer = (char*) malloc(sizeof(char) * BUFFER_SIZE);
    int limit = 0;
    int pos = 0;

    FILE * input = fopen(fileName, "rb");

    int lastRead;
    Row currentRow;
    uint32_t * currentElement = &(currentRow.a);

    // We read rows with an FSM. We can
    // roll up some of the code using the `currentElement` pointer
    while (1) {
        switch(lastRead = readChar(input, buffer, &pos, &limit)) {
            case END_OF_FILE:
                fclose(input);
                return;
            case ',':
                if (currentElement == &(currentRow.a))
                    currentElement = &(currentRow.b);
                else
                    currentElement = &(currentRow.t);
                break;
            case '\n':
                insert(table, currentRow);
                currentRow.a = 0;
                currentRow.b = 0;
                currentRow.t = 0;
                currentElement = &(currentRow.a);
                break;
            default:
                *currentElement = *currentElement * 10 + (lastRead - '0');
                break;
        }
    }  //printf("Read %d", lastRead);
}

int main(int argc, char** argv) {
    Row* table = (Row*) malloc(sizeof(Row) * TABLE_SIZE);
    memset(table, 255, sizeof(Row) * TABLE_SIZE);

    readAll(argv[1], table);

    //[...]
}

You might want to change the function to use [`fgets`](http://en.cppreference.com/w/c/io/fgets) instead, and then use e.g. [`strtok`](http://en.cppreference.com/w/c/string/byte/strtok) to "tokenize" the input into its three numbers and the remainder of the line (which is the text). — Some programmer dude, Sep 05 '14 at 13:32
@JoachimPileborg The problem is speed. This method is being used as it is a lot faster and the file is GBs in size. — Simd, Sep 05 '14 at 13:33
Are you sure the code, as shown, is faster than the normal buffering that the standard I/O functions already have? It's using another layer of buffering on top of the existing buffering (which usually is tuned to fit the disk block size). — Some programmer dude, Sep 05 '14 at 13:35
@JoachimPileborg Pretty sure. It comes from http://codegolf.stackexchange.com/questions/26643/filter-a-large-file-quickly . The code as it is runs faster than `wc` which is pretty amazing. — Simd, Sep 05 '14 at 13:36
If you're certain, then you need to add another state to the `','` case so you know that you're done with the numbers and now read text. Then of course modify the `default` case so it knows about this text-reading state. And on newline, terminate the string. — Some programmer dude, Sep 05 '14 at 13:42
Oh and the code-golf author couldn't have been to knowledgeable about C, because in C you [should not cast the result of `malloc`](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc). — Some programmer dude, Sep 05 '14 at 13:43
@JoachimPileborg Thank you. I had the general idea but I just had problems getting that to work and was hoping for some help. I would be grateful for any improvements to the code. — Simd, Sep 05 '14 at 13:44
Some ways to add an extra state: If `currentElement` is `NULL` then you're reading the trailing text. And you need a new pointer (or array index) into the array where to store the next character. And you need to check for text longer than your array (39 characters plus terminator). And you might need to be able to handle the case where there is no extra text? — Some programmer dude, Sep 05 '14 at 13:48
One last thing, you have to try it yourself. You can't rely on code you just copy (and maybe not understand?) from sites on the Internet, then you won't be able to actually learn anything. The best (only?) way to learn is by *doing*, i.e. writing code yourself. — Some programmer dude, Sep 05 '14 at 13:50
Also, just because someone says "this code is fast", it might be fast for that persons use-case, but not for your, and also don't blindly trust it, but do measurements and profiling yourself. First write a "naive" version, try to optimize it, and if still not acceptable efficiency (and you might want to specify the requirements and use-case for us to know about it) then you can try to find more efficient code (just remember to measure it and compare it to what you have). — Some programmer dude, Sep 05 '14 at 13:52
@JoachimPileborg I have tested it. It is very very fast. It's easy to test, you should give it a go. I also have a slow version already. Thank you for the advice however. — Simd, Sep 05 '14 at 13:53
Oh and a very last last thing... ;) Even if there are more efficient or faster ways of doing something, those ways often have other drawbacks, the most important being readability/maintainability. A slower solution might still have acceptable speed, but is probably way more readable (and therefore more maintainable) than the one you have here. If you have a hard time understanding this now, think about how it will be in a few months if you need to come back and look at it again when it's not fresh in memory. — Some programmer dude, Sep 05 '14 at 13:58
@JoachimPileborg It would have been great if your final comment had been code :) — Simd, Sep 05 '14 at 14:00
Simply expand the `if()` block in `case ',':` to cope with the 3rd `,` to get and store the comment. — chux - Reinstate Monica, Sep 05 '14 at 15:19

John Hascall · Answer 1 · 2014-09-05T16:12:14.187

You need to recognize the third comma and fill in .text when you find it, something along these lines:

        case ',':
            if (currentElement == &(currentRow.a)) {
                currentElement = &(currentRow.b);
                break;
            }
            if (currentElement == &(currentRow.b)) {
                currentElement = &(currentRow.t);
                break;
            }
            {   int i = 0;
                int maxchars = sizeof(currentRow->text) - 1;

                while ((lastRead = readChar(input, buffer, &pos, &limit)) != '\n') {
                    if (i < maxchars) currentRow->text[i++] = lastRead;
                }
                currentRow->text[i] = '\0';
            }
            /* fallthrough*/

Suggest `i < sizeof(currentRow->text) - 1` – chux - Reinstate Monica Sep 05 '14 at 15:59 — chux - Reinstate Monica, Sep 05 '14 at 15:59

score 1 · Accepted Answer · answered Sep 05 '14 at 16:11

This will probably do what you want, which is considerably different than how I would have done any of this, but its your code, and I'm looking for a minimal solution.

First, define a macro for your text field length

#define TEXT_LEN    40

and use it in your structure:

typedef struct Row {
    uint32_t a;
    uint32_t b;
    uint32_t t;
    char text[TEXT_LEN];
} Row;

Next, modify your readAll to do this:

void readAll(char * fileName, Row * table)
{
    char* buffer = (char*) malloc(sizeof(char) * BUFFER_SIZE);
    int limit = 0;
    int pos = 0;

    FILE * input = fopen(fileName, "rb");

    int lastRead;
    Row currentRow;
    uint32_t * currentElement = &(currentRow.a);
    size_t txt_len = 0;

    while (1)
    {
        switch(lastRead = readChar(input, buffer, &pos, &limit))
        {
            case END_OF_FILE:
                fclose(input);
                return;

            case ',':
                // move from a to b
                if (currentElement == &(currentRow.a))
                    currentElement = &(currentRow.b);

                // move from b to t
                else if (currentElement == &(currentRow.b))
                    currentElement = &(currentRow.t);

                // move from t to NULL, begin trailing text
                else
                    currentElement = NULL;
                break;

            case '\n':
                // terminate text string
                currentRow.text[txt_len] = 0;
                insert(table, currentRow);
                currentRow.a = 0;
                currentRow.b = 0;
                currentRow.t = 0;
                txt_len = 0;
                currentElement = &(currentRow.a);
                break;

            default:
                // only if  there is a current element to parse as uint32_t
                if (currentElement)
                    *currentElement = *currentElement * 10 + (lastRead - '0');

                // else we're parsing trailing text
                else if (txt_len < (TEXT_LEN-1))
                    currentRow.text[txt_len++] = lastRead;

                // else we consume the char. as we have no space for it anyway
                break;
        }
    }
}

Notes:

Worth mentioning, your code will skip the last entry in the file if it is not terminated by a newline. Addressing that is not entirely trivial, in particular because of the double-buffering being done. The double buffering is a waste, and frequently done to avoid the overhead of the mandated locking characteristics of fgetc() and getc(). If only a single thread is reading the file, you can avoid this and significantly boost your performance by doing the following:

Open the file in read-only mode (you're doing this already)
Invoke flockfile(input) to lock the file for your thread.
Consume the file with your loop using getc_unlocked(input)
Upon reaching EOF, invoke funlockfile(input) and then fclose(input);

Doing the above will eliminate the need for readChar entirely, and reduce your code base significantly.

Best of luck.

Thank you very much. I will try this as soon as I can. I am interested that you say it could be made even faster. It is very fast currently (even faster than wc for example). It is certainly only one thread using it. How much of a speed up do you get? I also really like the idea of getting rid of readChar entirely! — Simd, Sep 05 '14 at 21:09
@user2179021 As I don't have your data files, I can't really comment on the speed difference; it is simply the way I would likely implement it rather than using trying to double-buffer, and in some platform cases (AS/400 for example) double buffering actually *hinders* performance, as you're actually robbing the system of memory it would otherwise use to efficiently buffer for you. You should try the above and see how it performs. — WhozCraig, Sep 05 '14 at 21:38

How to add the tail of a line to a hash table

2 Answers2