9

I am trying to tokenize a string but I need to know exactly when no data is seen between two tokens. e.g when tokenizing the following string "a,b,c,,,d,e" I need to know about the two empty slots between 'd' and 'e'... which I am unable to find out simply using strtok(). My attempt is shown below:

char arr_fields[num_of_fields];
char delim[]=",\n";
char *tok;
tok=strtok(line,delim);//line contains the data

for(i=0;i<num_of_fields;i++,tok=strtok(NULL,delim))
{
    if(tok)
        sprintf(arr_fields[i], "%s", tok);
    else
        sprintf(arr_fields[i], "%s", "-");          
}

Executing the above code with the aforementioned examples put characters a,b,c,d,e into first five elements of arr_fields which is not desirable. I need the position of each character to go in specific indexes of array: i.e if there is a character missing between two characters, it should be recorded as is.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user1126425
  • 2,385
  • 4
  • 18
  • 17
  • 5
    @DhaivatPandya: That's not very useful advice unless it's accompanied by a reason... – Oliver Charlesworth Jan 02 '12 at 22:15
  • You mean "between 'c' and 'd'" ? – Eternal_Light Jan 02 '12 at 22:31
  • It is extremely accurate advice. The trouble is that `strtok()` is designed to ignore repeats of the token separator characters, and it obliterates them. Therefore, if you need to know about adjacent token separators, or if you need to know which separator marked the end of a token, you cannot use `strtok()` for the job. – Jonathan Leffler Jan 02 '12 at 22:39

6 Answers6

18

7.21.5.8 the strtok function

The standard says the following regarding strtok:

[#3] The first call in the sequence searches the string pointed to by s1 for the first character that is not contained in the current separator string pointed to by s2. If no such character is found, then there are no tokens in the string pointed to by s1 and the strtok function returns a null pointer. If such a character is found, it is the start of the first token.

In the above quote we can read you cannot use strtok as a solution to your specific problem, since it will treat any sequential characters found in delims as a single token.


Am I doomed to weep in silence, or can somebody help me out?

You can easily implement your own version of strtok that does what you want, see the snippets at the end of this post.

strtok_single makes use of strpbrk (char const* src, const char* delims) which will return a pointer to the first occurrence of any character in delims that is found in the null-terminated string src.

If no matching character is found the function will return NULL.


strtok_single

char *
strtok_single (char * str, char const * delims)
{
  static char  * src = NULL;
  char  *  p,  * ret = 0;

  if (str != NULL)
    src = str;

  if (src == NULL)
    return NULL;

  if ((p = strpbrk (src, delims)) != NULL) {
    *p  = 0;
    ret = src;
    src = ++p;

  } else if (*src) {
    ret = src;
    src = NULL;
  }

  return ret;
}

sample use

  char delims[] = ",";
  char data  [] = "foo,bar,,baz,biz";

  char * p    = strtok_single (data, delims);

  while (p) {
    printf ("%s\n", *p ? p : "<empty>");

    p = strtok_single (NULL, delims);
  }

output

foo
bar
<empty>
baz
biz
Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
  • @JonathanLeffler It is a simple example of an implementation following the rules set out by `strtok` but that confirms to what OP wishes for. Yes, `strtok` is not thread-safe, though that is a completely different matter than what OP is asking. – Filip Roséen - refp Jan 02 '12 at 22:48
  • @JonathanLeffler I never said anything to contradict what you are saying, I am well aware of the pitfalls that pop up (or down) when using `strtok`. – Filip Roséen - refp Jan 02 '12 at 22:58
  • @FilipRoséen-refp I have a question on one of your answers. Can you please see http://stackoverflow.com/questions/30294129/i-need-a-mix-of-strtok-and-strtok-single – aVC May 18 '15 at 02:10
  • 1
    Note that this version of `strtok_single()` doesn't return the segment after the last delimiter. There's a fixed version in this [answer](http://stackoverflow.com/a/30295426/15168), along with demonstration code of the problem. – Jonathan Leffler May 18 '15 at 05:27
  • @ChristopheQuintard well spotted, I think that particular fix got lost in some historic revision (it has now been fixed, [see history](http://stackoverflow.com/posts/8706031/revisions)). – Filip Roséen - refp Aug 27 '15 at 15:35
  • @JonathanLeffler I just noticed that you provided a fix for the bug addressed by [@ChristopheQuintard](http://stackoverflow.com/users/2017567/christophe-quintard), though functional I think you are interested in my fix that I just edited in. – Filip Roséen - refp Aug 27 '15 at 15:39
  • That looks approximately equivalent, though different in detail. The results look the same, which is good. (I can only upvote once — you already had it — because the "write your own if the standard tool doesn't do the job" advice is entirely valid, albeit to be used with caution. You need to make sure there isn't an alternative standard tool that you could use instead before reinventing the wheel.) – Jonathan Leffler Aug 27 '15 at 18:59
2

Lately I was looking for a solution to the same problem and found this thread.

You can use strsep(). From the manual:

The strsep() function was introduced as a replacement for strtok(3), since the latter cannot handle empty fields.

Miroslaw Opoka
  • 119
  • 2
  • 6
2

You can't use strtok() if that's what you want. From the man page:

A sequence of two or more contiguous delimiter characters in the parsed string is considered to be a single delimiter. Delimiter characters at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always nonempty strings.

Therefore it is just going to jump from c to d in your example.

You're going to have to parse the string manually or perhaps search for a CSV parsing library that would make your life easier.

Brian Roach
  • 76,169
  • 12
  • 136
  • 161
1

As mentioned in this answer, you'll want to implement something like strtok yourself. I prefer using strcspn (as opposed to strpbrk), as it allows for fewer if statements:

char arr_fields[num_of_fields];
char delim[]=",\n";
char *tok;

int current_token= 0;
int token_length;
for (i = 0; i < num_of_fields; i++, token_length = strcspn(line + current_token,delim))
{
    if(token_length)
        sprintf(arr_fields[i], "%.*s", token_length, line + current_token);
    else
        sprintf(arr_fields[i], "%s", "-");
    current_token += token_length;
}
Community
  • 1
  • 1
MSN
  • 53,214
  • 7
  • 75
  • 105
0
  1. Parse (for example, strtok)
  2. Sort
  3. Insert
  4. Rinse and repeat as needed :)
paulsm4
  • 114,292
  • 17
  • 138
  • 190
0

You could try using strchr to find out the locations of the , symbols. Tokenize manually your string up to the token you found (using memcpy or strncpy) and then use again strchr. You will be able to see if two or more commas are next to each other this way (strchr will return numbers that their subtraction will equal 1) and you can write an if statement to handle that case.

Eternal_Light
  • 676
  • 1
  • 7
  • 21
  • Since the delimiters can be comma or newline, `strchr()` is not the appropriate tool to use. – Jonathan Leffler Jan 02 '12 at 23:03
  • can't strchr() locate the '\n' value? – Eternal_Light Jan 02 '12 at 23:08
  • Yes, `strchr()` can find commas, and it can find newlines, but to find the next 'comma or newline', you have to call `strchr()` twice, once to look for the comma, once for the newline. – Jonathan Leffler Jan 02 '12 at 23:09
  • Cant you use a 'case' statement? I think it is not the most appropriate tool to use but it can solve the problem alright. – Eternal_Light Jan 02 '12 at 23:10
  • 1
    There are other string functions - `strspn()`, `strcspn()`, `strpbrk()` in particular - that do most of the needed job. Yes, I'm sure you could write it using a `case` statement, but it isn't what springs to mind. – Jonathan Leffler Jan 02 '12 at 23:14
  • +1 because you are right :) I always tend to use the most simplistic tool and build by hand around it to try and make it work as I want it to. This, obviously, gets me in a lot of trouble... – Eternal_Light Jan 02 '12 at 23:18