I think the best way to do the splits with an ordered sequence of delimeters is
to replicate strtok_r
behaviour using strstr
, like this:
#include <stdio.h>
#include <string.h>
char *substrtok_r(char *str, const char *substrdelim, char **saveptr)
{
char *haystack;
if(str)
haystack = str;
else
haystack = *saveptr;
char *found = strstr(haystack, substrdelim);
if(found == NULL)
{
*saveptr = haystack + strlen(haystack);
return *haystack ? haystack : NULL;
}
*found = 0;
*saveptr = found + strlen(substrdelim);
return haystack;
}
int main(void)
{
char line[] = "a -> b -> c -> d; Test - Number -> <10.0> ->No->split->here";
char *input = line;
char *token;
char *save;
while(token = substrtok_r(input, " ->", &save))
{
input = NULL;
printf("token: '%s'\n", token);
}
return 0;
}
This behaves like strtok_r
but only splits when the substring is found. The
output of this is:
$ ./a
token: 'a'
token: ' b'
token: ' c'
token: ' d; Test - Number'
token: ' <10.0>'
token: 'No->split->here'
And like strtok
and strtok_r
, it requires that the source string is
modifiable, as it writes the '\0'
-terminating byte for creating and returning
the tokens.
EDIT
Hi, would you mind explaining why '*found = 0'
means the return value is only the string in-between delimiters. I don't really understand what is going on here or why it works. Thanks
The first thing you've got to understand is how strings work in C. A string is
just a sequence of bytes (characters) that ends with the '\0'
-terminating
byte. I wrote bytes and characters in parenthesis, because a character in C is
just a 1-byte value (on most systems a byte is 8 bit long) and the integer
values representing the characters are those defined in the ASSCI code
table, which are 7-bit long values. As you can see from the table the
value 97 represents the character 'a'
, 98 represents 'b'
, etc. Writing
char x = 'a';
is the same as doing
char x = 97;
The value 0 is an special value for strings, it is called NUL
(null character)
or '\0'
-terminating byte. This value is used to tell the functions where a
string ends. A function like strlen
that returns the length of a string, does
it by counting how many bytes it encounters until it encounters a byte with
the value 0.
That's why strings are stored using char
arrays, because a pointer to an array
gives to the start of the memory block where the sequence of char
s is stored.
Let's look at this:
char string[] = { 'H', 'e', 'l', 'l', 'o', 0, 48, 49, 50, 0 };
The memory layout for this array would be
0 1 2 3 4 5 6 7 8 9
+-----+-----+-----+-----+-----+----+-----+-----+-----+----+
| 'H' | 'e' | 'l' | 'l' | 'o' | \0 | '0' | '1' | '2' | \0 |
+-----+-----+-----+-----+-----+----+-----+-----+-----+----+
or to be more precise with the integer values
0 1 2 3 4 5 6 7 8 9 10
+----+-----+-----+-----+-----+---+----+----+----+---+
| 72 | 101 | 108 | 108 | 111 | 0 | 48 | 49 | 50 | 0 |
+----+-----+-----+-----+-----+---+----+----+----+---+
Note that the value 0 represents '\0'
, 48 represents '0'
, 49 represents
'1'
and 50 represents '2'
. If you do
printf("%lu\n", strlen(string));
the output will be 5. strlen
will find the value 0 at the 5th position and
stop counting, however string
stores two strings, because from the 6th
position on, a new sequence of characters starts that also terminates with 0, thus making it a
second valid string in the array. To access it, you would need to have pointer
that points past the first 0 value.
printf("1. %s\n", string);
printf("2. %s\n", string + strlen(string) + 1);
The output would be
Hello
012
This property is used in functions like strtok
(and mine above) to return you
a substring from a larger string, without the need of creating a copy (that would be
creating a new array, dynamically allocating memory, using strcpy
to create
the copy).
Assume you have this string:
char line[] = "This is a sentence;This is another one";
Here you have one string only, because the '\0'
-terminating byte comes after
the last 'e'
in the string. If I however do:
line[18] = 0; // same as line[18] = '\0';
then I created two strings in the same array:
"This is a sentence\0This is another one"
because I replaced the semicolon ';'
with '\0'
, thus creating a new string
from position 0 to 18 and a second one from position 19 to 38. If I do now
printf("string: %s\n", line);
the output will be
string: This is a sentence
Now let's us take look at the function itself:
char *substrtok_r(char *str, const char *substrdelim, char **saveptr);
The first argument is the source string, the second argument is the delimiters
strings and the third one is doule pointer of char
. You have to pass a pointer
to a pointer of char
. This will be used to remember where the function should
resume scanning next, more on that later.
This is the algorithm:
if str is not NULL:
start a new scan sequence from str
otherwise
resume scanning from string pointed to by *saveptr
found position of substring_d pointed to by 'substrdelim'
if no such substring_d is found
if the current character of the scanned text is \0
no more substrings to return --> return NULL
otherwise
return the scanned text and set *saveptr to
point to the \0 character of the scanned text,
so that the next iteration ends the scanning
by returning NULL
otherwise (a substring_d was found)
create a new substring_a until the found one
by setting the first character of the found
substring_d to 0.
update *saveptr to the start of the found substring_d
plus it's previous length so that *saveptr
points to the past the delimiter sequence found in substring_d.
return new created substring_a
This first part is easy to understand:
if(str)
haystack = str;
else
haystack = *saveptr;
Here if str
is not NULL
, you want to start a new scan sequence. That's why
in main
the input
pointer is set to point to the start of the string saved
in line
. Every other iteration must be called with str == NULL
, that's
why the first thing is done in the while
loop is to set input = NULL;
so
that substrtok_r
resumes scanning using *saveptr
. This is the standard
behaviour of strtok
.
The next step is to look for a delimiting substring:
char *found = strstr(haystack, substrdelim);
The next part handles the case where no delimiting substring is
found2:
if(found == NULL)
{
*saveptr = haystack + strlen(haystack);
return *haystack ? haystack : NULL;
}
*saveptr
is updated to point past the whole source, so that it points to the
'\0'
-terminating byte. The return line can be rewritten as
if(*haystack == '\0')
return NULL
else
return haystack;
which says if the source already is an empy string1, then return
NULL
. This means no more substring are found, end calling the function. This
is also standard behaviour of strtok
.
The last part
*found = 0;
*saveptr = found + strlen(substrdelim);
return haystack;
is handles the case when a delimiting substring is found. Here
*found = 0;
is basically doing
found[0] = '\0';
which creates substrings as explained above. To make it clear once again, before
Before
*found = 0;
*saveptr = found + strlen(substrdelim);
return haystack;
the memory looks like this:
+-----+-----+-----+-----+-----+-----+
| 'a' | ' ' | '-' | '>' | ' ' | 'b' | ...
+-----+-----+-----+-----+-----+-----+
^ ^
| |
haystack found
*saveptr
After
*found = 0;
*saveptr = found + strlen(substrdelim);
the memory looks like this:
+-----+------+-----+-----+-----+-----+
| 'a' | '\0' | '-' | '>' | ' ' | 'b' | ...
+-----+------+-----+-----+-----+-----+
^ ^ ^
| | |
haystack found *saveptr
because strlen(substrdelim)
is 3
Remember if I do printf("%s\n", haystack);
at this point, because the '-'
in
found has been set to 0, it will print a
. *found = 0
created two strings out
of one like exaplained above. strtok
(and my function which is based on
strtok
) uses the same technique. So when the function does
return haystack;
the first string in token
will be the token before the split. Eventually
substrtok_r
returns NULL
and the loop exists, because substrtok_r
returns
NULL
when no more split can be created, just like strtok
.
Fotenotes
1An empty string is a string where the first character is already the
'\0'
-terminating byte.
2This is very important part. Most of the standard functions in the C
library like strstr
will not return you a new string in memory, will
not create a copy and return a copy (unless the documentation says so). The
will return you a pointer pointing to the original plus an offset.
On success strstr
will return you a pointer to the start of the substring,
this pointer will be at an offset to the source.
const char *txt = "abcdef";
char *p = strstr(txt, "cd");
Here strstr
will return a pointer to the start of the substring "cd"
in
"abcdef"
. To get the offset you do p - txt
which returns how many bytes
there are appart
b = base address where txt is pointing to
b b+1 b+2 b+3 b+4 b+5 b+6
+-----+-----+-----+-----+-----+-----+------+
| 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | '\0' |
+-----+-----+-----+-----+-----+-----+------+
^ ^
| |
txt p
So txt
points to address b
, p
points to address b+2
. That's why you get
the offset by doing p-txt
which would be (b+2) - b => 2
. So p
points to
the original address plus the offset of 2 bytes. Because of this bahaviour
things like *found = 0;
work in the first place.
Note that doing things like txt + 2
will return you a new pointer pointing to
the where txt
points plus the offset of 2. This is called pointer arithmetic.
It's like regualr arithmetic but here the compiler takes the size of an object
into consideration. char
is a type that is defined to have the size of 1,
hence sizeof(char)
returns 1. But let's say you have an array of integers:
int arr[] = { 7, 2, 1, 5 };
On my system an int
has size of 4, so an int
object needs 4 bytes in memory.
This array looks like this in memory:
b = base address where arr is stored
address base base + 4 base + 8 base + 12
in bytes +-----------+-----------+-----------+-----------+
| 7 | 2 | 1 | 5 |
+-----------+-----------+-----------+-----------+
pointer arr arr + 1 arr + 2 arr + 3
arithmetic
Here arr + 1
returns you a pointer pointing to where arr
is stored plus an
offset of 4 bytes.