Splitting strings in language C

Question

I was wondering we would go about splitting strings into tokens or any other efficient ways of doing this.

i.e. I have...

char string1[] = "hello\tfriend\n";

How would I get "hello" and "friend" in their own separate variables?

possible duplicate of [Split string with delimiters in C](http://stackoverflow.com/questions/9210528/split-string-with-delimiters-in-c) — Norbert, Apr 13 '15 at 01:50
There are 2 primary ways: (1) with `strtok` or (2) simple pointer use to identify the beginning of each word and the separator character following the word. — David C. Rankin, Apr 13 '15 at 02:00

David C. Rankin · Accepted Answer · 2015-04-14T05:19:31.830

Here is a very simple example splitting your string into parts saved in an array of character arrays using a start and end pointer. The MAXL and MAXW defines simply are a convenient way to define constants that are used to limit the individual word length to 32 (31 chars + null terminator) and a maximum of 3 words (parts) of the original string:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXL 32
#define MAXW 3

int main (void) {

    char string1[] = "hello\tfriend\n";
    char *sp = string1;                 /* start pointer        */
    char *ep = string1;                 /* end pointer          */
    unsigned c = 0;                     /* temp character       */
    unsigned idx = 0;                   /* index for part       */
    char strings[MAXW][MAXL] = {{0}};   /* array to hold parts  */

    while (*ep)                         /* for each char in string1 */
    {
        if (*ep == '\t' || *ep == '\n') /* test if \t or \n         */
        {
            c = *ep;                    /* save character           */
            *ep = 0;                    /* replace with null-termator   */
            strcpy (strings[idx], sp);  /* copy part to strings array   */
            *ep = c;                    /* replace w/original character */
            idx++;                      /* increment index          */
            sp = ep + 1;                /* set start pointer        */
        }
        ep++;                           /* advance to next char     */
    }

    printf ("\nOriginal string1 : %s\n", string1);

    unsigned i = 0;
    for (i = 0; i < idx; i++)
        printf ("  strings[%u] : %s\n", i, strings[i]);

    return 0;
}

Output

$ ./bin/split_hello

Original string1 : hello        friend

  strings[0] : hello
  strings[1] : friend

Using strtok simply replaces the manual pointer logic with the function call to split the string.

Updated Line-end Handling Example

As you have found, when stepping though the string you can create as simple an example as you need to fit the current string, but with a little extra effort you can expand your code to handle a broader range of situations. In your comment you noted that the above code does not handle the situation where there is no newline at the end of the string. Rather than changing the code to handle just that situation, with a bit of thought, you can improve the code so it handles both situations. One approach would be:

    while (*ep)                         /* for each char in string1 */
    {
        if (*ep == '\t' || *ep == '\n') /* test if \t or \n         */
        {
            c = *ep;                    /* save character           */
            *ep = 0;                    /* replace with null-termator   */
            strcpy (strings[idx], sp);  /* copy part to strings array   */
            *ep = c;                    /* replace w/original character */
            idx++;                      /* increment index          */
            sp = ep + 1;                /* set start pointer        */
        }
        else if (!*(ep + 1))  {         /* check if next is ending  */
            strcpy (strings[idx], sp);  /* handle no ending '\n'    */
            idx++;
        }

        ep++;                           /* advance to next char     */
    }

Break on Any Format/Non-Print Character

Continuing to broaden characters that can be used to separate the strings, rather than using discrete values to identify which characters divide the words, you can use a range of ASCII values to identify all non-printing or format characters as separators. A slightly different approach can be used:

    char string1[] = "\n\nhello\t\tmy\tfriend\tagain\n\n";
    char *p = string1;                  /* pointer to char      */
    unsigned idx = 0;                   /* index for part       */
    unsigned i = 0;                     /* generic counter      */
    char strings[MAXW][MAXL] = {{0}};   /* array to hold parts  */

    while (*p)                          /* for each char in string1 */
    {
        if (idx == MAXW) {              /* test MAXW not exceeded   */
            fprintf (stderr, "error: MAXW (%d) words in string exceeded.\n", MAXW);
            break;
        }

        /* skip each non-print/format char */
        while (*p && (*p < ' ' || *p > '~'))
            p++;

        if (!*p) break;                 /* if end of s, break       */

        while (*p >= ' ' && *p <= '~')  /* for each printable char  */
        {
            strings[idx][i] = *p++;     /* copy to strings array    */
            i++;                        /* advance to next position */
        }

        strings[idx][i] = 0;            /* null-terminate strings   */
        idx++;                          /* next index in strings    */
        i = 0;                          /* start at beginning char  */
    }

This will handle your test string regardless of line ending and regardless of the number of tabs or newlines included. Take a look at ASCII Table and Description as a reference for the character ranges used.

Note that you cannot reliably use `strtok()` on string literals as it modifies the string it analyzes, and string literals are often not modifiable. — Jonathan Leffler, Apr 13 '15 at 03:15
Well, yes. You would also be expected to make a copy of the string first to prevent leaving your original littered with null-terminating characters. Good point. — David C. Rankin, Apr 13 '15 at 03:32
Hi, I tried out your suggestion and I was wondering, what if the text was just "hello\tfriend". When I tried to do strings[1] it wouldn't come out. Thanks! — Antwon, Apr 13 '15 at 07:27
This is a fairly specific example, but to cover the case where there is no newline, after `ep++;` add `if (!ep) strcpy (strings[idx], sp);` and try again. Pointers are simple, you just have to think about where they are pointing. When learning, it helps to write the string out on paper and follow along each loop by hand. All becomes clear.`:p` — David C. Rankin, Apr 13 '15 at 20:58
Uh, that should be `(!*ep) strcpy (strings[idx], sp);` (the derefence is important to check whether the ending null-termintor has been reached) — David C. Rankin, Apr 13 '15 at 21:12
Thank you so much @DavidC.Rankin . I was wondering, how would you go about deleting one of the index and shifting it over? For example: If I went about deleting string[0], what's left is strings[1] (friends). We would then shift strings[1] to index 0. and strings[1] will now be NULL? I'm trying to figure this out, but I haven't got an idea. Thanks! — Antwon, Apr 14 '15 at 04:54
Glad I could help. For the sake of completeness, I went ahead and updated the last example to include checks for exceeding the maximum number of words and included code to consider multiple separators as a single separator (e.g. `hello\t\t\tfriend` would be treated as `hello\tfriend`). Your next challenge is to move this code to a function that returns `char **` and dynamically allocate the pointer array, so that all you need do in `main()` is `char **strings = splitstr (string1, &idx);` to fill the `strings` array and return the number of words in `idx` `:p` — David C. Rankin, Apr 14 '15 at 05:13

Splitting strings in language C

1 Answers1