Text file processing speed

Question

So I have a text file containing about 666 000 lines of maximum 10 numbers/line separated by a space. Example:

8 38 62 39 4 50 86 43
53 78 38 22 39 29 78 5
24 13 58 92
.......
53 78 38 22 39 29 78 5

Given a sequence of n numbers, I have to verify if a line has all its elements from the sequence.

I have tried something like this:

int check()
{
    int nrelem = 0, nr, line_int[11];
    int found_counter = 0, sw = 0;
    char *p;

    f = fopen("temp.txt", "rt");

    while (!feof(f))
    {
        nrelem = 0; found_counter = 0; sw = 0;

        fgets(line, 256, f);

        p = strtok(line, " ");
        while (p != NULL)
        {
            sscanf(p, "%d", &line_int[nrelem++]);
            p = strtok(NULL, " ");
        }

        for (int i = 0; i < n; i++)
        {
            for (int j = 0; j < nrelem; j++)
            {
                if (seq[i] == line_int[j])
                {
                    sw = 1;
                    break;
                }
            }
            if (sw)
                found_counter++;
        }
        if (found_counter == nrelem)
            return 0;
    }
    fclose(f);
    return 1;
}

The problem is that the running time for this function, at 600 000 lines / file is about 14 seconds. I guess it is the way I get my elements from each line of the file with strtok and the implementation with files. Do you guys know a better approach to this, that can reduce the running time below 1 second without the need of a quantum computer? :D Thank you in advance.

"I guess" --- performance optimisation should be based on measurements, not assumptions. Profile it, then optimise. — zerkms, Oct 25 '17 at 21:51
Done that already, can't get below 1 second with this approach. Can't even get below 10 seconds. That's why I need help. @zerkms — MrB, Oct 25 '17 at 21:57
"Done that already," --- so what exactly did the profiling tell you? What is the bottleneck? — zerkms, Oct 25 '17 at 21:57
fgets alone takes 1,6 seconds / text file. The while with strtok takes another 10 seconds/text file. @zerkms — MrB, Oct 25 '17 at 22:00
What about reading the entire file to memory using `fread` and then do the text processing in memory? — seladb, Oct 25 '17 at 22:02
If your sequence has a relatively low range (say, the only possible numbers are from `0` to `99`), you can implement a very basic hashmap that reduces your innermost `for` loop to a single `if` statement, making the algorithm `n` times faster. — Patrick Roberts, Oct 25 '17 at 22:08
yes, that is the reason i need to reduce the runtime of this function. because i call it for 200 000 sequences to validate them. @xing — MrB, Oct 25 '17 at 22:09
Please see [Why is “while ( !feof (file) )” always wrong?](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong) It is better to control the loop with `while(fgets(line, 256, f) != NULL)`. And add the consequent final newline read to the `strtok` delimiters, say `" \n"`. — Weather Vane, Oct 25 '17 at 22:10
will take a look at that. but the BIG problem is here: p = strtok(line, " "); while (p != NULL) { sscanf(p, "%d", &line_int[nrelem++]); p = strtok(NULL, " "); } THIS TAKES 10 SECONDS. here I try to convert each line of characters from string to an array of ints. How to solve this ? THIS IS THE BIG ISSUE :D Can this be reduced below 1 second somehow, for the entire text file ? — MrB, Oct 25 '17 at 22:12
:)) already changed that, thank you! :D I am willing to learn, that's why I'm here @WeatherVane — MrB, Oct 25 '17 at 22:17
I don't see your declaration for `line` anywhere. Using local variables instead of global variables might also speed up your algorithm, if even by a small amount. — Patrick Roberts, Oct 25 '17 at 22:24
Does the sequence of numbers in the line have to match the sequence you have, or do you have to ensure that regardless of the order of the elements in the line, the numbers are the same? Can you have repeats of a number in a single line? Eyeballing your data shows no repeats. Are the numbers limited to 1..99 or can they be bigger or smaller than that? Have you considered: `int num_values = sscanf(line, "%d%d%d%d%d%d%d%d%d%d", &line_int[0], &line_int[1], &line_int[2], &line_int[3], &line_int[4], &line_int[5], &line_int[6], &line_int[7], &line_int[8], &line_int[9]);`? _[...continued...]_ — Jonathan Leffler, Oct 25 '17 at 22:58
_[...continuation...]_ That reads up to 10 numbers on a line in one call to `sscanf()`. You know how many numbers were read. You can immediately go into 'does the list found match the required list' processing mode. It won't be slower than what you've got; it might be quicker. It also doesn't mangle the input line, unlike using `strtok()`. Also, before going into the quadratic search the lists loop, you should check that the sequence from the input line is at least as long as the sequence you need to match; otherwise, it can't match, but you expend time proving that. — Jonathan Leffler, Oct 25 '17 at 23:00
Would there be any merit in sorting the sequence to be matched (once, before reading any data). Then you could sort the sequence from the input line (an O(NlogN) operation) and then compare the sorted sequences. That's an O(N) operation, giving you an O(NlogN) overall complexity, instead of the O(N*N) operation your nested loop gives you. However, the timing constants matter on small set (10 elements) whereas they're not important with asymptotic work (hundreds or thousands of numbers per line); the sort might just slow things down. — Jonathan Leffler, Oct 25 '17 at 23:07
Does the program exit once it has found the first line where all numbers are whitelisted? If so, to process the complete file, every line must contain some numbers not from the list, is this true? Can this file be downloaded somewhere? — A.K., Oct 26 '17 at 00:50

Patrick Roberts · Answer 1 · 2017-11-15T19:12:21.067

0

As I suggested in comments, using a hash map in your algorithm would lower the time complexity in checking if a number is in the sequence seq. I also changed the signature of the function to something that made a little more sense, as well as incorporating some suggestions from other comments.

Note that this function has not been tested, other than compiling without errors. It requires <stdbool.h>, <stdio.h>, <stdlib.h>, and <string.h> in order to compile:

bool check(char *filename, int *seq, size_t seqSize) {
    size_t lookupSize = 100;
    char lookup[lookupSize];
    FILE *fp;
    char *line = NULL;
    size_t len = 0;
    long int num;
    bool result = true;

    memset(hash, 0, lookupSize);

    for (size_t index = 0; index < seqSize; index++) {
        hash[seq[index]] = 1;
    }

    fp = fopen(filename, "r");

    if (fp != NULL) {
        while (getline(&line, &len, fp) != -1 && result) {
            char *cp = line;
            result = false;

            while (*cp != '\0' && !result) {
                num = strtol(cp, &cp, 10);
                result = num < lookupSize && !lookup[num];
            }
        }

        fclose(fp);
        free(line);
    }

    return result;
}

I set lookupSize to 100, since it appears that your numbers are in the range of 0 to 99. If you have larger numbers, you'll need to declare an array for your lookup table with a size equal to one more than the max integer value in the range you're using, which is why I only recommended this approach for a relatively small range.

edited Nov 15 '17 at 19:12

answered Oct 26 '17 at 00:09

Patrick Roberts

49,224
10
102
153

@MooingDuck [that's intentional](https://stackoverflow.com/a/3501681/1541563). If `*lineptr` is `NULL`, then `getline()` will allocate a buffer for storing the line, which should be freed by the user program. Straight from [the manual](https://linux.die.net/man/3/getline). – Patrick Roberts Oct 26 '17 at 00:14
Whoever's downvoted this (or anyone else), would you mind explaining why? As far as I can tell, this is a perfectly valid answer, and it's touched on all the suggestions made in comments, as well as modifying the suspect bottleneck portion of the code profiled by OP. I'm happy to improve my answer further if there's an issue with it. – Patrick Roberts Oct 26 '17 at 01:04
Have a look at `strtol` -- it does everything `atoi` does and more, handles errors better, and returns the separator between items so you won't need the call to `strchr` anymore either. BTW, that wouldn't normally be considered a hash table either, formally speaking it is one, but using an identity function as the hash. – Ben Voigt Nov 15 '17 at 15:49
@BenVoigt thanks for the suggestion! When I have some more time to look this over, I'll work that into my solution. Also, I suppose calling it a sparse array or something would be more accurate, is that what you're saying? – Patrick Roberts Nov 15 '17 at 16:05
1

"Array" would be accurate, as would "bitmap", "bitset", "lookup table", etc. It's just the word "hash" that is misleading, since there is no hashing function here. – Ben Voigt Nov 15 '17 at 18:59

Text file processing speed

1 Answers1