split string into words in C

Question

I have just coded splitting string into words.

if char *cmd = "Hello world baby", then argv[0] = "Hello", argv[1] = "world", argv[2] = "baby".

strdup function cannot be used, and I want to implement this using malloc and strcpy.

my code is below.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define buf_size 128

int main() {
    
    char *argv[16];
    memset(argv, 0, sizeof(argv));
    int words = 0;

    char *cmd = "Hello world baby";
    unsigned int len = strlen(cmd);
    int start = 0;
    for(unsigned int i = 0; i <= len; i++){
        if(cmd[i] == ' ' | cmd[i] == '\0'){
            ++words;
            char *w = (char *)malloc(sizeof(char)*(i-start) + 1);
            strcpy(w, cmd + start);
            w[i-start] = '\0';
            argv[i] = w;
            start = i + 1;
        }
    }

    for(int i = 0; i < words; i++){
    printf("%s\n", argv[i]);
    free(argv[i]);
    }
    return 0;
}

I hoped that the printf function produces:

Hello
world
baby

However, when the printf() function is reached, the program triggers a segmentation fault.

Can't use `strdup`? Write your own: `char * my_strdup(const char *src) { size_t len = strlen(src); char *buf = malloc(len + 1); strcpy(buf,src); return buf; }` That's legal, based on your assignment. And, recommended. Suppose you had to do this in _multiple_ places. Everybody would suggest creating a function to eliminate code replication (i.e. DRY--don't repeat yourself). — Craig Estey, Nov 02 '22 at 18:24
You must use a different index for the `argv`. For instance, `int j = 0;` and replace `argv[i] = w` with `argv[j++] = w` after checking for `j < 16`. Also replace the `|` with `||`. — M. Nejat Aydin, Nov 02 '22 at 18:28
What @CraigEstey said except that the code ***must*** check that the allocation worked — `char * my_strdup(const char *src) { size_t len = strlen(src); char *buf = malloc(len + 1); if (buf != NULL) strcpy(buf,src); return buf; }` — you can choose how to write the test, but you must test the result of `malloc()` before using it. There's a chance that `memmove(buf, src, len + 1);` will be quicker than `strcpy()` — it doesn't have to test each byte as it progresses. And you can use `memcpy()` instead of `memmove()` here, but I use `memmove()` because it is always correct and `memcpy()` isn't. — Jonathan Leffler, Nov 02 '22 at 18:51
@JonathanLeffler I realized that [at 6 minutes]. I was too lazy to fix it with another comment. But, then I thought "an exercise". But, OP didn't respond. It's slightly faster to use `memcpy`, so, hopefully, I acquit myself with: `char * my_strdup(const char *src) { size_t len = strlen(src) + 1; char *buf = malloc(len); if (buf != NULL) memcpy(buf,src,len); return buf; }` — Craig Estey, Nov 02 '22 at 18:58
@CraigEstey — that saves a `+ 1` that I didn't optimize out. Let's call it quits (even-stevens). — Jonathan Leffler, Nov 02 '22 at 19:06

Jonathan Leffler · Accepted Answer · 2022-11-02T19:42:15.250

Your primary problem, despite the banter in the comments about how to write your own version of strdup(), is that you really need strndup(). Eh?

You have the line:

strcpy(w, cmd + start);

Unfortunately, this copies the whole string from cmd + start into the allocated space, but you only wanted to copy (i - start) + 1 bytes including the null byte, because that's all the space you allocated. So, you have a buffer overflow (but not a stack overflow).

POSIX provides the function strndup() with the signature:

extern char *strndup(const char *s, size_t size);

This allocates at most size + 1 bytes and copies at most size bytes from s and a null byte into the allocated space. You'd use:

argv[i] = strndup(cmd + start, i - start);

to get the required result. If you don't have (or can't use) strndup(), you can write your own. That's easiest if you have strnlen(), but you can write your own version of that if necessary (you don't have it or can't use it):

char *my_strndup(const char *s, size_t len)
{
    size_t nbytes = strnlen(s, len);
    char *result = malloc(nbytes + 1);
    if (result != NULL)
    {
        memmove(result, s, nbytes);
        result[nbytes] = '\0';
    }
    return result;
}

This deals with the situation where the actual string is shorter than the maximum length it can be by using the size from strnlen(). It's not clear that you are guaranteed to be able to access the memory at s + nbytes - 1, so simply allocating for the maximum size is not appropriate.

Implementing strnlen():

size_t my_strnlen(const char *s, size_t size)
{
    size_t count = 0;
    while (count < size && *s++ != '\0')
        count++;
    return count;
}

"Official" versions of this are probably implemented in assembler and are more efficient, but I think that's a valid implementation in pure C.

Another alternative in your code is to use the knowledge of the length:

            char *w = (char *)malloc(sizeof(char)*(i - start + 1));
            memmove(w, cmd + start, i - start);
            w[i-start] = '\0';
            argv[i] = w;
            start = i + 1;

I note in passing that multiplying by sizeof(char) is a no-op since sizeof(char) == 1 by definition. You should include the + 1 in the multiplication in general (as I've reparenthesized the expression). If you were dealing with some structure and wanted N + 1 structures, you need to use (N + 1) * sizeof(struct WhatNot) and not N * sizeof(struct WhatNot) + 1. It's a good idea to head off bugs caused by sloppy coding practices while you're learning, even though there's no difference in the result here.

There are those who excoriate the cast on the result of malloc(). I'm not one of them: I learned to program on a pre-standard C system where the cast was crucial because the char * address of an object was different from the address of the same memory location when referenced via a pointer to a type bigger than a char. That is, the short * address and char * address for the same memory location had different bit patterns. Not casting the result of malloc() led to crashes. (I observe that the primary excuse given for rejecting the cast is "it may hide errors if malloc() is not declared". That excuse went by the wayside when C99 mandated that functions must be declared before being used.)

^{Warning: no compiler was consulted about the validity of any of the code shown in this answer. Nor was the sanity of the overall algorithm tested.}

You have:

if(cmd[i] == ' ' | cmd[i] == '\0'){

That | should be ||.

split string into words in C

1 Answers1