Split a string based on contiguous delimiters

Question

I'm looking to split a sting based on a specific sequence of characters but only if they are in order.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
  int i = 0;
  char **split;
  char *tmp;

  split = malloc(20 * sizeof(char *));
  tmp  = malloc(20 * 12 * sizeof(char));
  for(i=0;i<20;i++)
  {
    split[i] = &tmp[12*i];
  }

  char *line;
  line = malloc(50 * sizeof(char));

  strcpy(line, "Test - Number -> <10.0>");
  printf("%s\n", line);
  i = 0;

  while( (split[i] = strsep(&line, " ->")) != NULL)
  {
    printf("%s\n", split[i]);
    i++;
  }
}

This will print out:

Test 
Number
<10.0

However I just want to split around the -> so it could give the output:

Test - Number
<10.0>

Possible duplicate of [What are the differences between strtok and strsep in C](https://stackoverflow.com/questions/7218625/what-are-the-differences-between-strtok-and-strsep-in-c) — lost_in_the_source, Feb 11 '18 at 18:19
You can use `strstr` to find a substring and use it for calculating the splits. — Pablo, Feb 11 '18 at 18:23
@WeatherVane His input is currently hardcoded: `strcpy(line, "Test - Number -> <10.0>");` — Steve Summit, Feb 11 '18 at 18:24
Both `strtok` and `strtok` interpret their delimiter string as "any of these characters may delimit", so neither of those functions will come close to doing what you want. — Steve Summit, Feb 11 '18 at 18:25
@Pablo I deleted a similar comment because the input might be `"<11.0>"` and not `"<10.0>"`. That only works on the hard coded input and fixed expectation. — Weather Vane, Feb 11 '18 at 18:27
You can find the first `'<'` with `strchr` and then extract the value with `sscanf` which will stop at the `'>'`. Alternatively, you don't have to use the same delimiter set with `strtok` and `strsep` each time. You can use `"<"` the first time and `">"` the second time (and restructure the loop). — Weather Vane, Feb 11 '18 at 18:32
@WeatherVane I think you've completely misunderstood his question. — Steve Summit, Feb 11 '18 at 18:46
@WeatherVane I still think `strstr` is the best way to do that, by replicating what `strtok_r` does but using `strstr` instead if `strchr` (for every char of the delim). — Pablo, Feb 11 '18 at 18:47
@stackptr I don't think this is a duplicate of that question. The OP's is just using the wrong function. — Pablo, Feb 11 '18 at 18:49

Pablo · Accepted Answer · 2018-02-11T23:22:52.150

I think the best way to do the splits with an ordered sequence of delimeters is to replicate strtok_r behaviour using strstr, like this:

#include <stdio.h>
#include <string.h>

char *substrtok_r(char *str, const char *substrdelim, char **saveptr)
{
    char *haystack;

    if(str)
        haystack = str;
    else
        haystack = *saveptr;

    char *found = strstr(haystack, substrdelim);

    if(found == NULL)
    {
        *saveptr = haystack + strlen(haystack);
        return *haystack ? haystack : NULL;
    }

    *found = 0;
    *saveptr = found + strlen(substrdelim);

    return haystack;
}


int main(void)
{
    char line[] = "a -> b -> c -> d; Test - Number -> <10.0> ->No->split->here";

    char *input = line;
    char *token;
    char *save;

    while(token = substrtok_r(input, " ->", &save))
    {
        input = NULL;
        printf("token: '%s'\n", token);
    }

    return 0;
}

This behaves like strtok_r but only splits when the substring is found. The output of this is:

$ ./a 
token: 'a'
token: ' b'
token: ' c'
token: ' d; Test - Number'
token: ' <10.0>'
token: 'No->split->here'

And like strtok and strtok_r, it requires that the source string is modifiable, as it writes the '\0'-terminating byte for creating and returning the tokens.

EDIT

Hi, would you mind explaining why '*found = 0' means the return value is only the string in-between delimiters. I don't really understand what is going on here or why it works. Thanks

The first thing you've got to understand is how strings work in C. A string is just a sequence of bytes (characters) that ends with the '\0'-terminating byte. I wrote bytes and characters in parenthesis, because a character in C is just a 1-byte value (on most systems a byte is 8 bit long) and the integer values representing the characters are those defined in the ASSCI code table, which are 7-bit long values. As you can see from the table the value 97 represents the character 'a', 98 represents 'b', etc. Writing

char x = 'a';

is the same as doing

char x = 97;

The value 0 is an special value for strings, it is called NUL (null character) or '\0'-terminating byte. This value is used to tell the functions where a string ends. A function like strlen that returns the length of a string, does it by counting how many bytes it encounters until it encounters a byte with the value 0.

That's why strings are stored using char arrays, because a pointer to an array gives to the start of the memory block where the sequence of chars is stored.

Let's look at this:

char string[] = { 'H', 'e', 'l', 'l', 'o', 0, 48, 49, 50, 0 };

The memory layout for this array would be

0     1     2     3     4     5    6     7     8     9
+-----+-----+-----+-----+-----+----+-----+-----+-----+----+
| 'H' | 'e' | 'l' | 'l' | 'o' | \0 | '0' | '1' | '2' | \0 |
+-----+-----+-----+-----+-----+----+-----+-----+-----+----+

or to be more precise with the integer values

0    1     2     3     4     5   6    7     8   9   10
+----+-----+-----+-----+-----+---+----+----+----+---+
| 72 | 101 | 108 | 108 | 111 | 0 | 48 | 49 | 50 | 0 |
+----+-----+-----+-----+-----+---+----+----+----+---+

Note that the value 0 represents '\0', 48 represents '0', 49 represents '1' and 50 represents '2'. If you do

printf("%lu\n", strlen(string));

the output will be 5. strlen will find the value 0 at the 5th position and stop counting, however string stores two strings, because from the 6th position on, a new sequence of characters starts that also terminates with 0, thus making it a second valid string in the array. To access it, you would need to have pointer that points past the first 0 value.

printf("1. %s\n", string);
printf("2. %s\n", string + strlen(string) + 1);

The output would be

Hello
012

This property is used in functions like strtok (and mine above) to return you a substring from a larger string, without the need of creating a copy (that would be creating a new array, dynamically allocating memory, using strcpy to create the copy).

Assume you have this string:

char line[] = "This is a sentence;This is another one";

Here you have one string only, because the '\0'-terminating byte comes after the last 'e' in the string. If I however do:

line[18] = 0;  // same as line[18] = '\0';

then I created two strings in the same array:

"This is a sentence\0This is another one"

because I replaced the semicolon ';' with '\0', thus creating a new string from position 0 to 18 and a second one from position 19 to 38. If I do now

printf("string: %s\n", line);

the output will be

string: This is a sentence

Now let's us take look at the function itself:

char *substrtok_r(char *str, const char *substrdelim, char **saveptr);

The first argument is the source string, the second argument is the delimiters strings and the third one is doule pointer of char. You have to pass a pointer to a pointer of char. This will be used to remember where the function should resume scanning next, more on that later.

This is the algorithm:

if str is not NULL:
    start a new scan sequence from str
otherwise
    resume scanning from string pointed to by *saveptr

found position of substring_d pointed to by 'substrdelim'

if no such substring_d is found
    if the current character of the scanned text is \0
        no more substrings to return --> return NULL
    otherwise
        return the scanned text and set *saveptr to
        point to the \0 character of the scanned text,
        so that the next iteration ends the scanning
        by returning NULL

otherwise (a substring_d was found)

    create a new substring_a until the found one
    by setting the first character of the found
    substring_d to 0.

    update *saveptr to the start of the found substring_d
    plus it's previous length so that *saveptr
    points to the past the delimiter sequence found in substring_d.

    return new created substring_a

This first part is easy to understand:

if(str)
    haystack = str;
else
    haystack = *saveptr;

Here if str is not NULL, you want to start a new scan sequence. That's why in main the input pointer is set to point to the start of the string saved in line. Every other iteration must be called with str == NULL, that's why the first thing is done in the while loop is to set input = NULL; so that substrtok_r resumes scanning using *saveptr. This is the standard behaviour of strtok.

The next step is to look for a delimiting substring:

char *found = strstr(haystack, substrdelim);

The next part handles the case where no delimiting substring is found²:

if(found == NULL)
{
    *saveptr = haystack + strlen(haystack);
    return *haystack ? haystack : NULL;
}

*saveptr is updated to point past the whole source, so that it points to the '\0'-terminating byte. The return line can be rewritten as

if(*haystack == '\0')
    return NULL
else
    return haystack;

which says if the source already is an empy string¹, then return NULL. This means no more substring are found, end calling the function. This is also standard behaviour of strtok.

The last part

*found = 0;
*saveptr = found + strlen(substrdelim);

return haystack;

is handles the case when a delimiting substring is found. Here

*found = 0;

is basically doing

found[0] = '\0';

which creates substrings as explained above. To make it clear once again, before

Before

*found = 0;
*saveptr = found + strlen(substrdelim);

return haystack;

the memory looks like this:

       +-----+-----+-----+-----+-----+-----+
       | 'a' | ' ' | '-' | '>' | ' ' | 'b' | ...
       +-----+-----+-----+-----+-----+-----+
       ^     ^
       |     |
haystack     found
*saveptr

After

*found = 0;
*saveptr = found + strlen(substrdelim);

the memory looks like this:

       +-----+------+-----+-----+-----+-----+
       | 'a' | '\0' | '-' | '>' | ' ' | 'b' | ...
       +-----+------+-----+-----+-----+-----+
       ^     ^                  ^
       |     |                  |
haystack     found              *saveptr
                                because strlen(substrdelim)
                                is 3

Remember if I do printf("%s\n", haystack); at this point, because the '-' in found has been set to 0, it will print a. *found = 0 created two strings out of one like exaplained above. strtok (and my function which is based on strtok) uses the same technique. So when the function does

return haystack;

the first string in token will be the token before the split. Eventually substrtok_r returns NULL and the loop exists, because substrtok_r returns NULL when no more split can be created, just like strtok.

Fotenotes

¹An empty string is a string where the first character is already the '\0'-terminating byte.

²This is very important part. Most of the standard functions in the C library like strstr will not return you a new string in memory, will not create a copy and return a copy (unless the documentation says so). The will return you a pointer pointing to the original plus an offset.

On success strstr will return you a pointer to the start of the substring, this pointer will be at an offset to the source.

const char *txt = "abcdef";
char *p = strstr(txt, "cd");

Here strstr will return a pointer to the start of the substring "cd" in "abcdef". To get the offset you do p - txt which returns how many bytes there are appart

b = base address where txt is pointing to

b     b+1   b+2   b+3   b+4   b+5   b+6
+-----+-----+-----+-----+-----+-----+------+
| 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | '\0' |
+-----+-----+-----+-----+-----+-----+------+
^           ^
|           |
txt         p

So txt points to address b, p points to address b+2. That's why you get the offset by doing p-txt which would be (b+2) - b => 2. So p points to the original address plus the offset of 2 bytes. Because of this bahaviour things like *found = 0; work in the first place.

Note that doing things like txt + 2 will return you a new pointer pointing to the where txt points plus the offset of 2. This is called pointer arithmetic. It's like regualr arithmetic but here the compiler takes the size of an object into consideration. char is a type that is defined to have the size of 1, hence sizeof(char) returns 1. But let's say you have an array of integers:

int arr[] = { 7, 2, 1, 5 };

On my system an int has size of 4, so an int object needs 4 bytes in memory. This array looks like this in memory:

b = base address where arr is stored

address       base        base + 4    base + 8    base + 12
in bytes      +-----------+-----------+-----------+-----------+
              |    7      |    2      |    1      |    5      |
              +-----------+-----------+-----------+-----------+
pointer       arr         arr + 1     arr + 2     arr + 3
arithmetic

Here arr + 1 returns you a pointer pointing to where arr is stored plus an offset of 4 bytes.

I see, looking for `" ->"` but this leaves unwanted spaces in the tokens, would `" -> "` be better? Has OP defined the input well enough? — Weather Vane, Feb 11 '18 at 19:27
@WeatherVane I wanted my function to behave like `strtok`. If you leave spaces between the delims `a : b : c` and you do `strtok(..., ':')` then your tokens are going to `a `, ` b `, ` c`. I wanted to emulate that behaviour. — Pablo, Feb 11 '18 at 19:46
Hi, would you mind explaining why '*found = 0' means the return value is only the string in-between delimiters. I don't really understand what is going on here or why it works. Thanks — John Meighan, Feb 11 '18 at 21:02
@JohnMeighan I've updated my answer addressing your last question — Pablo, Feb 11 '18 at 22:39

Split a string based on contiguous delimiters

1 Answers1