1

I am working on a machine learning application where my features are stored in huge text files. Currently the way I have implemented the data input reads, it is way to slow to be practical. Basically each line of the text file represents a feature vector in sparse format. For instance, following example contains three features in index:value fashion.

1:0.34 2:0.67 6:0.99 12:2.1 28:2.1
2:0.12 22:0.27 26:9.8 69:1.8
3:0.24 4:67.0 7:1.9 13:8.1 18:1.7 32:3.4

Following is how I am making the reads now. As I dont know the length of the feature string before hand, I just read a suitably large length which upper bounds the length of each string. Once, I have read the line from the file, I just use the strtok_r function to split the string into key value pairs and then further process it to store as a sparse array. Any ideas on how to speed this up are highly appreciated.

FILE *fp = fopen(feature_file, "r");

int fvec_length = 0;
char line[1000000];
size_t ln;
char *pair, *single, *brkt, *brkb;

SVECTOR **fvecs = (SVECTOR **)malloc(n_fvecs*sizeof(SVECTOR *));
if(!fvecs) die("Memory Error.");

int j = 0;

while( fgets(line,1000000,fp) ) {
    ln = strlen(line) - 1;
    if (line[ln] == '\n')
        line[ln] = '\0';

    fvec_length = 0;
    for(pair = strtok_r(line, " ", &brkt); pair; pair = strtok_r(NULL, " ", &brkt)){
        fvec_length++;
        words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
        if(!words) die("Memory error.");
        j = 0;
        for (single = strtok_r(pair, ":", &brkb); single; single = strtok_r(NULL, ":", &brkb)){
            if(j == 0){
                words[fvec_length-1].wnum = atoi(single);
            }
            else{
                words[fvec_length-1].weight = atof(single); 
            }
            j++;
        }
    }   
    fvec_length++; 
    words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
    if(!words) die("Memory error.");
    words[fvec_length-1].wnum = 0;
    words[fvec_length-1].weight = 0.0;

    fvecs[i] = create_svector(words,"",1);
    free(words);
    words = NULL;
}
fclose(fp);
return fvecs;
stressed_geek
  • 2,118
  • 8
  • 33
  • 45
  • replace ln = strlen(line) - 1; if (line[ln] == '\n') line[ln] = '\0'; with simply if (line[strlen(line) - 1] == '\n') line[strlen(line) - 1] = '\0'; – Ariel Pinchover Apr 18 '13 at 07:34
  • 2
    Did you try profiling it? – Johan Kotlinski Apr 18 '13 at 07:35
  • the if j==0 part is ridiculous, its 0 only the first time. do it once outside of the loop and then you can do without the if else – Ariel Pinchover Apr 18 '13 at 07:35
  • 2
    if(j==0) is guaranteed to take almost no time at all compared to all the disk I/O and mallocs – Johan Kotlinski Apr 18 '13 at 07:36
  • the second,or the first, words = (WORD *) realloc(words, fvec_length*sizeof(WORD)); is pointless. in any case you lose the one before. realloc from what i remember doesnt free the place you realloc from. – Ariel Pinchover Apr 18 '13 at 07:39
  • @kotlinski youre right, but even so, its still better than nothing. and ridculous. and besides, just because he has worse parts doesnt mean less bad parts should be neglected – Ariel Pinchover Apr 18 '13 at 07:40
  • @Infested The realloc is used to extend the size of words array dynamically. – stressed_geek Apr 18 '13 at 07:44
  • @kotlinski No I haven't tried profiling. I will look it up now. Meanwhile, any other suggestions are welcome. – stressed_geek Apr 18 '13 at 07:45
  • 1
    What performance do you get (MB/s) and what performance do you expect/need? The file read part looks fine, but I'm a bit worried about the frequent reallocs. Increasing the size in chunks may increase performance (provided realloc is the time hog). – Klas Lindbäck Apr 18 '13 at 07:50
  • i dont understand the way fvec_length is used, why do you increment it each time? – Ariel Pinchover Apr 18 '13 at 07:50
  • 1
    If you know each line has exactly 5 elements, you can easily use `fscanf()` for parsing directly into the the destination address. Could still use it or `sscanf` otherwise, but trickier. Also, try reallocating less often - e.g. allocate space for 100 WORDs to start with, then increase by a multiplier you're comfortable with (e.g. 1.1, 1.5, 2) to tune memory efficiency vs. probably # of resizes. (If you want ultimate speed, consider memory mapping the input file.) – Tony Delroy Apr 18 '13 at 07:55
  • @stressed_geek have you tried using dup2? – Ariel Pinchover Apr 18 '13 at 08:06
  • @KlasLindbäck I think I am getting read performance of around 20-25MB/s. Would like to be atleast 5 times that. – stressed_geek Apr 18 '13 at 08:12
  • I think the major suggestion is to go easy on the `realloc`. I will certainly try it now and let you guys know if that helps speed up things. – stressed_geek Apr 18 '13 at 08:13
  • @TonyD The number of elements in each line are different, I have edited my question to indicate the same. Thanks for your other suggestion. – stressed_geek Apr 18 '13 at 08:17
  • Unrelated, but still: you shouldn't [cast the return value of `malloc()` and friends, in C](http://stackoverflow.com/a/605858/28169). – unwind Apr 18 '13 at 08:41
  • Also unrelated, but don't write `realloc` like `x = realloc(x, ...)`. You wouldn't be able to check against failures and you would leak huge memory in such cases. – Shahbaz Apr 18 '13 at 09:34

3 Answers3

1
  1. You should absolutely reduce the number of memory allocations. The classic approach is to double the vector on each allocation, so that you get logarithmic number of allocation calls rather than linear.

  2. Since your line pattern seems constant, there's no need to tokenize it by hand, use a single sscanf() on each loaded line to directly scan into that line's words.

  3. Your line buffer seems extremely large, this can cost by blowing up the stack, worsening cache locality a bit.

unwind
  • 391,730
  • 64
  • 469
  • 606
  • Thanks! I tried the 3 sugggestions you gave. Regarding the first point, it did speed up the reads by around 50% which is pretty cool. `sscanf()` however for some strange reason seems to slow down the reads by an order of magnitude, so sticking to `strtok()`. Regarding the third suggestion, I switched to using `getline()`. Doesn't provide any performance improvements, however, the code is much neater. Would be grateful if you have any other suggestions! – stressed_geek Apr 18 '13 at 21:21
0

Probally when you are calling realloc, you are doing a system call. A system call is an expensive operational that involves a context exchange and switching from user to kernel space and vice versa.

It seems that you are doing a realloc call for each pair of token that you got. It is a lot of calls. You dont care in previous allocating 1MByte to a buffer pointed by file. Why are you so conservative about the buffer pointed by word?

Amadeus
  • 10,199
  • 3
  • 25
  • 31
  • Thanks, I am going to try out your suggestion to go easy on the `realloc`. Moreover, do you think that the 1MByte buffer pointed by the file might also be slowing things down? Should I look into smarter ways of doing that? – stressed_geek Apr 18 '13 at 08:15
0

I find that on Linux (Fedora) realloc() is extremely efficient and does not slow things down, particularly. On Windows, because of how the memory is structured, it can be disastrous.

My solution to the "lines of unknown length" problem is to write a function which makes multiple calls to fgets(), concatenating the results until a newline character is detected. The function accepts &maxlinelength as an argument, and if any call to fgets() would cause the concatenated string to exceed maxlinelength, then maxlinelength is adjusted. This way new memory is reallocated only until the longest line is found. Similarly, you would only need to realloc() for WORD if maxlinelength had been adjusted

JWDN
  • 382
  • 3
  • 13
  • Most of the times, it's the _standard library_ that is handling `realloc` calls. It wouldn't be Linux vs. Windows, but probably glibc vs. visual C's library. If you try with gcc and glibc on Windows, probably you wouldn't see any difference in efficiency with Linux. – Shahbaz Apr 18 '13 at 09:31
  • Thanks - I've heard the two OS's manage memory differently but you may be right - in any case we digress! – JWDN Apr 18 '13 at 09:50