String split in C with strtok function

Question

I'm trying to do split some strings by {white_space} symbol. btw, there is a problem within some splits. which means, I want to split by {white_space} symbol but also quoted sub-strings.

example,

char *pch;
char str[] = "hello \"Stack Overflow\" good luck!";
pch = strtok(str," ");
while (pch != NULL)
{
    printf ("%s\n",pch);
    pch = strtok(NULL, " ");
}

This will give me

hello
"Stack
Overflow"
good
luck!

But What I want, as you know,

hello
Stack Overflow
good
luck!

Any suggestion or idea please?

You're splitting on spaces, you're getting exactly what you're asking for. The quotation marks are just another character in the string. — Happington, Jun 06 '14 at 14:20
you'll have to do your own parsing: check if you are between an opening `"` and a closing `"`. — bolov, Jun 06 '14 at 14:29
if i write a C++ implementation for you is ok? I am too lazy to write this in C. — bolov, Jun 06 '14 at 14:31
@Happington oh! I've done edit question inside! There was some misunderstanding to explain my question : — user3524577, Jun 06 '14 at 14:36
It'd be interesting to see the C++ no matter what. strtok, is kinda broken, doesn't handle empty fields for example. — Rob11311, Jun 06 '14 at 14:37
@bolov sure, if logic or algorithm, could be on C too. thanks for comment. — user3524577, Jun 06 '14 at 14:37

score 2 · Answer 1 · answered Jun 06 '14 at 14:39

You'll need to tokenize twice. The program flow you currently have is as follows:

1) Search for space

2) Print all characters prior to space

3) Search for next space

4) Print all characters between last space, and this one.

You'll need to start thinking in a different matter, two layers of tokenization.

Search for Quotation Mark
On odd-numbered strings, perform your original program (search for spaces)
On even-numbered strings, print blindly

In this case, even numbered strings are (ideally) within quotes. ab"cd"ef would result in ab being odd, cd being even... etc.

The other side, is remembering what you need to do, and what you're actually looking for (in regex) is "[a-zA-Z0-9 \t\n]*" or, [a-zA-Z0-9]+. That means the difference between the two options, are whether it's separated by quotes. So separate by quotes, and identify from there.

Rob11311 · Accepted Answer · 2014-06-06T15:00:57.483

Try altering your strategy.

Look at non-white space things, then when you find quoted string you can put it in one string value.

So, you need a function that examines characters, between white space. When you find '"' you can change the rules and hoover everything up to a matching '"'. If this function returns a TOKEN value and a value (the string matched) then what calls it, can decide to do the correct output. Then you have written a tokeniser, and there actually exist tools to generate them called "lexers" as they are used widely, to implement programming languages/config files.

Assuming nextc reads next char from string, begun by firstc( str) :

for (firstc( str); ((c = nextc) != NULL;) {
    if (isspace(c))
        continue;
    else if (c == '"')
        return readQuote;       /* Handle Quoted string */
    else
        return readWord;        /* Terminated by space & '"' */
}
return EOS;

You'll need to define return values for EOS, QUOTE and WORD, and a way to get the text in each Quote or Word.

score 0 · Answer 3 · edited May 23 '17 at 12:15

Here's the code that works... in C

The idea is that you first tokenize the quote, since that's a priority (if a string is inside the quotes than we don't tokenize it, we just print it). And for each of those tokenized strings, we tokenize within that string on the space character, but we do it for alternate strings, because alternate strings will be in and out of the quotes.

#include <stdio.h>
#include <string.h>
#include <stdbool.h>

int main() {
  char *pch1, *pch2, *save_ptr1, *save_ptr2;
  char str[] = "hello \"Stack Overflow\" good luck!";
  pch1 = strtok_r(str,"\"", &save_ptr1);
  bool in = false;
  while (pch1 != NULL) {
    if(in) {
      printf ("%s\n", pch1);
      pch1 = strtok_r(NULL, "\"", &save_ptr1);
      in = false;
      continue;
    }
    pch2 = strtok_r(pch1, " ", &save_ptr2);
    while (pch2 != NULL) {
      printf ("%s\n",pch2);
      pch2 = strtok_r(NULL, " ", &save_ptr2);
    }
    pch1 = strtok_r(NULL, "\"", &save_ptr1);
    in = true;
  }
}

References

String split in C with strtok function

3 Answers3