4

I usually try hard and harder to solve myself any bugs I find in my code, but this one is totally out of any logic for me. It works really fine with whatever strings and char separators, but only with that useless printf inside the while of the function, otherwise it prints

-> Lorem

then

-> ▼

and crashes aftwerwards. Thanks in advance to anyone that could tell me what is happening.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stdint.h>

char **strsep_(char *str, char ch) {
    // Sub-string length
    uint8_t len = 0;
    // The number of sub-strings found means the same as the position where it will be stored in the main pointer
    // Obviously, the number tends to increase over time, and at the end of the algorithm, it means the main pointer length too
    uint8_t pos = 0;
    // Storage for any found sub-strings and one more byte as the pointer is null-terminated
    char **arr = (char**)malloc(sizeof(char **) + 1);
    while (*str) {
        printf("Erase me and it will not work! :)\n");
        if (*str == ch) {
            // The allocated memory should be one step ahead of the current usage
            arr = realloc(arr, sizeof(char **) * pos + 1);
            // Allocates enough memory in the current main pointer position and the '\0' byte
            arr[pos] = malloc(sizeof(char *) * len + 1);
            // Copies the sub-string size (based in the length number) into the previously allocated space
            memcpy(arr[pos], (str - len), len);
            // `-_("")_-k
            arr[pos][len] = '\0';
            len = 0;
            pos++;
        } else {
            len++;
        }
        *str++;
    }
    // Is not needed to reallocate additional memory if no separator character was found
    if (pos > 0) arr = realloc(arr, sizeof(char **) * pos + 1);
    // The last chunk of characters after the last separator character is properly allocated
    arr[pos] = malloc(sizeof(char *) * len + 1);
    memcpy(arr[pos], (str - len), len);
    // To prevent undefined behavior while iterating over the pointer
    arr[++pos] = NULL;

    return arr;
}

void strsep_free_(char **arr) {
    char **aux = arr;
    while (*arr) {
        free(*arr);
        *arr = NULL;
        arr++;
    }
    // One more time to fully deallocate the null-terminated pointer
    free(*arr);
    *arr = NULL;
    arr++;
    // Clearing The pointer itself 
    free(aux);
    aux = NULL;
}

int main(void) {
    char **s = strsep_("Lorem ipsum four words", ' ');
    char **i = s;
    while (*i != NULL) {
        printf("-> %s\n", *i);
        i++;
    }
    strsep_free_(s);
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 2
    In `malloc(sizeof(char **) + 1)`, what is the `+1` for? It makes no sense to allocate a single extra *byte*. – Some programmer dude Jul 09 '18 at 06:59
  • 1
    I also recommend that you read [Do I cast the result of malloc?](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc) – Some programmer dude Jul 09 '18 at 07:02
  • @Someprogrammerdude thats for the NULL byte I put at the end of the pointer before returning it, and I cast the malloc **just** for further compatibility with C++ –  Jul 09 '18 at 07:08
  • An array of pointers need no terminator. And since it's an array of pointers you need to add the size of the pointer, not a single byte. And actually you don't terminate the `arr` array anyway. – Some programmer dude Jul 09 '18 at 07:09
  • 1
    I agree, but if I do not put one more NULL byte at the end, iterating over the pointer as I do (with ++) will not work properly. (At least I did not when coding the function to free the structure) –  Jul 09 '18 at 07:20
  • Okay, if you want a terminator, remember that the size you pass to `malloc` (and `realloc`) is the size ***in bytes***, not elements. So adding `1` adds *one single byte*. On a 32-bit system a pointer is typically *4* bytes. – Some programmer dude Jul 09 '18 at 07:24
  • `// One more time to fully deallocate the null-terminated pointer` You only reach that line if `*arr==NULL`. No need to deallocate anything if you didn't allocate. – Gerhardh Jul 09 '18 at 07:24
  • @Someprogrammerdude Yeah! I meant that (4 bytes eg.)! xD –  Jul 09 '18 at 07:28
  • @Gerhardh Yeah, I agree with that one. But no matter what I do, this code only works with that damn `printf`, if I remove it, it crashes again. –  Jul 09 '18 at 07:32
  • 1
    I think it would be better to count the number of words in a string before split it. After this, you don't need to call malloc/realloc so often, you make only one system call. And it should help avoid a problem that you have now with all this memory allocation. – Nick S Jul 09 '18 at 08:20
  • 1
    `(char**)malloc(sizeof(char **) + 1);` - womp womp – M.M Jul 09 '18 at 09:41

3 Answers3

2

The probable reason for the crash is most likely this: realloc(arr, sizeof(char **) * pos + 1).

That is the same as realloc(arr, (sizeof(char **) * pos) + 1) which does not allocate enough space for your "array". You need to do realloc(arr, sizeof(char **) * (pos + 1)).

Same with the allocation for arr[pos], you need to use parentheses correctly there too.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • Well, it is even more absurd, **how this could work fine just putting a `printf` on it? So bizarre!** And, my logic can appear a bit strange but I think it is correct, I do not want to multiply the `pos + 1` by the `sizeof(char **)`, but I actually want just one more byte at the end of the pointer, on every line I allocate memory. The `pos` variable is actually the number of bytes needed for the the char array. –  Jul 09 '18 at 07:25
  • 3
    @PedroCosta Because [*undefined behavior*](https://en.wikipedia.org/wiki/Undefined_behavior)! Once you have it (which you will have) then all speculation about behavior becomes moot. – Some programmer dude Jul 09 '18 at 07:35
  • @PedroCosta You cannot access a single byte at the end of the pointers with a `char**`. This is not a string which is nul-terminated, but these are pointers. You also assign a full pointer to that location: `arr[++pos] = NULL;` Not just a single `\0` byte. – Gerhardh Jul 09 '18 at 08:51
2

Your program has undefined behavior, which means it may behave in unexpected ways, but could by chance behave as expected. Adding the extra printf changes the behavior in a way the seems to correct the bug, but only by coincidence. On a different machine, or even on the same machine at a different time, the behavior may again change.

There are multiple bugs in your program that lead to undefined behavior:

  • You are not allocating the array with the proper size: it should have space fpr pos + 1 pointers, hence sizeof(char **) * (pos + 1). The faulty statements are: char **arr = (char**)malloc(sizeof(char **) + 1); and arr = realloc(arr, sizeof(char **) * pos + 1);.

  • Furthermore, the space allocated for each substring is incorrect too: arr[pos] = malloc(sizeof(char *) * len + 1); should read arr[pos] = malloc(sizeof(char) * len + 1);, which by definition is arr[pos] = malloc(len + 1);. This does not lead to undefined behavior, you just allocate too much memory. If your system supports it, allocation and copy can be combined in one call to strndup(str - len, len).

  • You never check for memory allocation failure, causing undefined behavior in case of memory allocation failure.

  • Using uint8_t for len and pos is risky: what if the number of substrings exceeds 255? pos and len would silently wrap back to 0, producing unexpected results and memory leaks. There is no advantage at using such a small type, use int or size_t instead.

Here is a corrected version:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char **strsep_(const char *str, char ch) {
    // Sub-string length
    int len = 0;
    // The number of sub-strings found, index where to store the NULL at the end of the array.
    int pos = 0;
    // return value: array of pointers to substrings with an extra slot for a NULL terminator.
    char **arr = (char**)malloc(sizeof(*arr) * (pos + 1));
    if (arr == NULL)
        return NULL;
    for (;;) {
        if (*str == ch || *str == '\0') {
            // alocate the substring and reallocate the array
            char *p = malloc(len + 1);
            char **new_arr = realloc(arr, sizeof(*arr) * (pos + 2));
            if (new_arr == NULL || p == NULL) {
                // allocation failure: free the memory allocated so far
                free(p);
                if (new_arr)
                    arr = new_arr;
                while (pos-- > 0)
                    free(arr[pos]);
                free(arr);
                return NULL;
            }
            arr = new_arr;
            memcpy(p, str - len, len);
            p[len] = '\0';
            arr[pos] = p;
            pos++;
            len = 0;
            if (*str == '\0')
                break;
        } else {
            len++;
        }
        str++;
    }
    arr[pos] = NULL;
    return arr;
}

void strsep_free_(char **arr) {
    int i;
    // Free the array elements 
    for (i = 0; arr[i] != NULL; i++) {
        free(arr[i]);
        arr[i] = NULL;  // extra safety, not really needed
    }
    // Free The array itself 
    free(arr);
}

int main(void) {
    char **s = strsep_("Lorem ipsum four words", ' ');
    int i;
    for (i = 0; s[i] != NULL; i++) {
        printf("-> %s\n", s[i]);
    }
    strsep_free_(s);
    return 0;
}

Output:

-> Lorem
-> ipsum
-> four
-> words
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Holy Jesus! You rocked on your answer. **Really** thanks for your detailed explanation, I've never suffered from such kind of bug so I freaked out a bit about it. –  Jul 09 '18 at 20:34
0

Good answer from @chqrlie. From my side, I think it would be better to count everything before copy, it should help to avoid realloc.

#include <string.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

int count_chars(const char *str, const char ch)
{
    int i;
    int count;

    i = 0;
    count = 0;
    if (*str == ch)
        str++;

    while (str[i] != ch && str[i] != '\0')
    {
        count++;
        i++;
    }

    return (count);
}

int count_delimeter(const char *str, const char ch)
{
    int i = 0;
    int count = 0;

    while (str[i])
    {
        if (str[i] == ch && str[i + 1] != ch)
            count++;
        i++;
    }

    return count;
}

char** strsep_(const char *str, const char ch)
{
    char **arr;
    int index = 0;
    int size = 0;
    int i = 0;

    size = count_delimeter(str, ch) + 1;

    if ((arr = malloc(sizeof(char *) * (size + 1))) == NULL)
        return (NULL);
    arr[size] = NULL;

    while (i < size)
    {
        if (str[index] == ch)
            index++;

        if (str[index] && str[index] == ch && str[index + 1] == ch)
        {
            while (str[index] && str[index] == ch && str[index + 1] == ch)
                index++;
            index++;
        }

        int len = count_chars(&str[index], ch);
        if ((arr[i] = malloc(sizeof(char) * (len + 1))) == NULL)
            return NULL;

        memcpy(arr[i], &str[index], len);
        index += len;
        arr[i++][len] = '\0';
    }

    return arr;
}

int main(void)
{
    char *str = "Lorem   ipsum  ipsum Lorem lipsum gorem insum";
    char **s = strsep_(str, ' ');
    /* char *str = "Lorem + Ipsum"; */
    /* char **s = strsep_(str, '+'); */
    /* char *str = "lorem, torem, horem, lorem"; */
    /* char **s = strsep_(str, ','); */
    while (*s != NULL) {
        printf("-> [%s]\n", *s);
        s++;
    }

    /* dont forget to free */
    return 0;
}
Nick S
  • 1,299
  • 1
  • 11
  • 23
  • Counting the separators in an initial pass to avoid `realloc` is a good idea. Why do you consider sequences of separators to represent a single separator? `strtok()` as similar semantics, but Javascript's `Array.prototype.split()`, which is closer to the OP's function does not. – chqrlie Jul 09 '18 at 17:31
  • Nice idea too, Nick! That would make the function ever harder to crash with bad input while running. I'm actually coding my own HTTP parser, so that function will be used on every request on the top of a socket. I decided to use pointer to pointer and multiple reallocs because RAM will be a major constraint in my project, so doing that way could save precious bytes. Although, realloc'ing that often can drain processing time too. Sometimes, it is hard to choose which resource you will sacrifice! –  Jul 10 '18 at 02:01