5

I am writing some code that needs to read fasta files, so part of my code (included below) is a fasta parser. As a single sequence can span multiple lines in the fasta format, I need to concatenate multiple successive lines read from the file into a single string. I do this, by realloc'ing the string buffer after reading every line, to be the current length of the sequence plus the length of the line read in. I do some other stuff, like stripping white space etc. All goes well for the first sequence, but fasta files can contain multiple sequences. So similarly, I have a dynamic array of structs with a two strings (title, and actual sequence), being "char *". Again, as I encounter a new title (introduced by a line beginning with '>') I increment the number of sequences, and realloc the sequence list buffer. The realloc segfaults on allocating space for the second sequence with

*** glibc detected *** ./stackoverflow: malloc(): memory corruption: 0x09fd9210 ***
Aborted

For the life of me I can't see why. I've run it through gdb and everything seems to be working (i.e. everything is initialised, the values seems sane)... Here's the code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <math.h>
#include <errno.h>

//a struture to keep a record of sequences read in from file, and their titles
typedef struct {
    char *title;
    char *sequence;
} sequence_rec;

//string convenience functions

//checks whether a string consists entirely of white space
int empty(const char *s) {
    int i;
    i = 0;
    while (s[i] != 0) {
        if (!isspace(s[i])) return 0;
        i++;
    }
    return 1;
}

//substr allocates and returns a new string which is a substring of s from i to
//j exclusive, where i < j; If i or j are negative they refer to distance from
//the end of the s
char *substr(const char *s, int i, int j) {
    char *ret;
    if (i < 0) i = strlen(s)-i;
    if (j < 0) j = strlen(s)-j;
    ret = malloc(j-i+1);
    strncpy(ret,s,j-i);
    return ret;
}

//strips white space from either end of the string
void strip(char **s) {
    int i, j, len;
    char *tmp = *s;
    len = strlen(*s);
    i = 0;
    while ((isspace(*(*s+i)))&&(i < len)) {
        i++;
    }
    j = strlen(*s)-1;
    while ((isspace(*(*s+j)))&&(j > 0)) {
        j--;
    }
    *s = strndup(*s+i, j-i);
    free(tmp);
}


int main(int argc, char**argv) {
    sequence_rec *sequences = NULL;
    FILE *f = NULL;
    char *line = NULL;
    size_t linelen;
    int rcount;
    int numsequences = 0;

    f = fopen(argv[1], "r");
    if (f == NULL) {
        fprintf(stderr, "Error opening %s: %s\n", argv[1], strerror(errno));
        return EXIT_FAILURE;
    }
    rcount = getline(&line, &linelen, f);
    while (rcount != -1) {
        while (empty(line)) rcount = getline(&line, &linelen, f);
        if (line[0] != '>') {
            fprintf(stderr,"Sequence input not in valid fasta format\n");
            return EXIT_FAILURE;
        }

        numsequences++;
        sequences = realloc(sequences,sizeof(sequence_rec)*numsequences);
        sequences[numsequences-1].title = strdup(line+1); strip(&sequences[numsequences-1].title);
        rcount = getline(&line, &linelen, f);
        sequences[numsequences-1].sequence = malloc(1); sequences[numsequences-1].sequence[0] = 0;
        while ((!empty(line))&&(line[0] != '>')) {
            strip(&line);
            sequences[numsequences-1].sequence = realloc(sequences[numsequences-1].sequence, strlen(sequences[numsequences-1].sequence)+strlen(line)+1);
            strcat(sequences[numsequences-1].sequence,line);
            rcount = getline(&line, &linelen, f);
        }
    }
    return EXIT_SUCCESS;
}
sirlark
  • 2,187
  • 2
  • 18
  • 28
  • Thanks for all the comments about the substring routine. I've fixed it in my code. I also noticed however that the way I dealt with negative indexes was wrong. I should add the negative index, not subtract it. That being said, I also realised that I copied the substr function in error, as I don't call it in the rest of the pasted code. – sirlark Jan 23 '12 at 14:54
  • `strip()` is also buggy. It will do bad things with zero-length strings. It looks like you don't call it with such strings, but I think it would be a good thing to fix for when it's used elsewhere. – Michael Burr Jan 23 '12 at 17:27

3 Answers3

4

You should use strings that look something like this:

struct string {
    int len;
    char *ptr;
};

This prevents strncpy bugs like what it seems you saw, and allows you to do strcat and friends faster.

You should also use a doubling array for each string. This prevents too many allocations and memcpys. Something like this:

int sstrcat(struct string *a, struct string *b)
{
    int len = a->len + b->len;
    int alen = a->len;
    if (a->len < len) {
        while (a->len < len) {
            a->len *= 2;
        }
        a->ptr = realloc(a->ptr, a->len);
        if (a->ptr == NULL) {
            return ENOMEM;
        }
    }
    memcpy(&a->ptr[alen], b->ptr, b->len);
    return 0;
}

I now see you are doing bioinformatics, which means you probably need more performance than I thought. You should use strings like this instead:

struct string {
    int len;
    char ptr[0];
};

This way, when you allocate a string object, you call malloc(sizeof(struct string) + len) and avoid a second call to malloc. It's a little more work but it should help measurably, in terms of speed and also memory fragmentation.

Finally, if this isn't actually the source of error, it looks like you have some corruption. Valgrind should help you detect it if gdb fails.

leif
  • 1,987
  • 4
  • 19
  • 22
  • @lief: Memory consumption is more of an issue than speed. I don't want to allocate in doubling chunks and have the space wasted. Admittedly, this is not a problem in the fasta parser, more in the processing. – sirlark Jan 23 '12 at 15:27
  • Due to internal malloc fragmentation, you may end up using too much memory even if you don't request it. Doubling arrays are quite well trusted, but if they are too scary, please at least use something like `malloc_usable_size` to measure your fragmentation. If you do choose doubling arrays, and my second suggestion of allocating the length and string buffer together, be careful that you include the length in the size calculation, or you may end up with awful fragmentation (if you allocate 2^n + sizeof int, for example). – leif Jan 23 '12 at 16:06
  • `char ptr[0];` is invalid C. You mean `char ptr[];`, but this is probably a bad name for the element now since it's an array, not a pointer. I would call it `data` or `contents` or something along those lines. – R.. GitHub STOP HELPING ICE Jan 23 '12 at 17:37
  • And there's no reason to think allocating `2^n+sizeof int` should result in worse fragmentation than allocating `2^n`. Powers of two have no special status here. (If you're talking about allocations so large to be serviced by the vm, e.g. `mmap`, they're large enough that even wasting a whole page is a tiny waste, percentage-wise, maybe ~2%). – R.. GitHub STOP HELPING ICE Jan 23 '12 at 17:39
  • If you're going to implement your own higher-level string data type, you might want to consider evaluating the large number of existing libraries that do this. Here's a pretty nice list of some of the more well-known ones: http://www.and.org/vstr/comparison – Michael Burr Jan 23 '12 at 19:44
  • char p[0] works in every compiler I've seen. Most allocators (jemalloc, dlmalloc) have power of two-binned allocation classes so allocating 2^n+1 will result in nearly 100% fragmentation overhead for almost all n. – leif Jan 24 '12 at 03:32
3

One potential issue is here:

strncpy(ret,s,j-i);
return ret;

ret might not get a null terminator. See man strncpy:

       char *strncpy(char *dest, const char *src, size_t n);

       ...

       The strncpy() function is similar, except that at most n bytes  of  src
       are  copied.  Warning: If there is no null byte among the first n bytes
       of src, the string placed in dest will not be null terminated.

There's also a bug here:

j = strlen(*s)-1;
while ((isspace(*(*s+j)))&&(j > 0)) {

What if strlen(*s) is 0? You'll end up reading (*s)[-1].

You also don't check in strip() that the string doesn't consist entirely of spaces. If it does, you'll end up with j < i.

edit: Just noticed that your substr() function doesn't actually get called.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

I think the memory corruption problem might be the result of how you're handling the data used in your getline() calls. Basically, line is reallocated via strndup() in the calls to strip(), so the buffer size being tracked in linelen by getline() will no longer be accurate. getline() may overrun the buffer.

while ((!empty(line))&&(line[0] != '>')) {

    strip(&line);    // <-- assigns a `strndup()` allocation to `line`

    sequences[numsequences-1].sequence = realloc(sequences[numsequences-1].sequence, strlen(sequences[numsequences-1].sequence)+strlen(line)+1);
    strcat(sequences[numsequences-1].sequence,line);

    rcount = getline(&line, &linelen, f);   // <-- the buffer `line` points to might be
                                            //      smaller than `linelen` bytes

}
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • You can get some nice, simple, tested functions to trim strings in-place here: http://stackoverflow.com/a/2452438/12711 Using `trim()` from that link will solve this problem (and the other latent bug(s) in the `strip()` function). – Michael Burr Jan 23 '12 at 19:05
  • I fixed all the problem with strip (and substr) and still had the problem. The interaction with getline and linelen was defintitely the problem. Thanks for all the help – sirlark Jan 26 '12 at 09:40