1

I'm trying to do split some strings by {white_space} symbol. btw, there is a problem within some splits. which means, I want to split by {white_space} symbol but also quoted sub-strings.

example,

char *pch;
char str[] = "hello \"Stack Overflow\" good luck!";
pch = strtok(str," ");
while (pch != NULL)
{
    printf ("%s\n",pch);
    pch = strtok(NULL, " ");
}

This will give me

hello
"Stack
Overflow"
good
luck!

But What I want, as you know,

hello
Stack Overflow
good
luck!

Any suggestion or idea please?

user3524577
  • 57
  • 1
  • 9

3 Answers3

2

You'll need to tokenize twice. The program flow you currently have is as follows:

1) Search for space

2) Print all characters prior to space

3) Search for next space

4) Print all characters between last space, and this one.

You'll need to start thinking in a different matter, two layers of tokenization.

  1. Search for Quotation Mark
  2. On odd-numbered strings, perform your original program (search for spaces)
  3. On even-numbered strings, print blindly

In this case, even numbered strings are (ideally) within quotes. ab"cd"ef would result in ab being odd, cd being even... etc.

The other side, is remembering what you need to do, and what you're actually looking for (in regex) is "[a-zA-Z0-9 \t\n]*" or, [a-zA-Z0-9]+. That means the difference between the two options, are whether it's separated by quotes. So separate by quotes, and identify from there.

Happington
  • 454
  • 2
  • 8
1

Try altering your strategy.

Look at non-white space things, then when you find quoted string you can put it in one string value.

So, you need a function that examines characters, between white space. When you find '"' you can change the rules and hoover everything up to a matching '"'. If this function returns a TOKEN value and a value (the string matched) then what calls it, can decide to do the correct output. Then you have written a tokeniser, and there actually exist tools to generate them called "lexers" as they are used widely, to implement programming languages/config files.

Assuming nextc reads next char from string, begun by firstc( str) :

for (firstc( str); ((c = nextc) != NULL;) {
    if (isspace(c))
        continue;
    else if (c == '"')
        return readQuote;       /* Handle Quoted string */
    else
        return readWord;        /* Terminated by space & '"' */
}
return EOS;

You'll need to define return values for EOS, QUOTE and WORD, and a way to get the text in each Quote or Word.

Rob11311
  • 1,396
  • 8
  • 10
0

Here's the code that works... in C

The idea is that you first tokenize the quote, since that's a priority (if a string is inside the quotes than we don't tokenize it, we just print it). And for each of those tokenized strings, we tokenize within that string on the space character, but we do it for alternate strings, because alternate strings will be in and out of the quotes.

#include <stdio.h>
#include <string.h>
#include <stdbool.h>

int main() {
  char *pch1, *pch2, *save_ptr1, *save_ptr2;
  char str[] = "hello \"Stack Overflow\" good luck!";
  pch1 = strtok_r(str,"\"", &save_ptr1);
  bool in = false;
  while (pch1 != NULL) {
    if(in) {
      printf ("%s\n", pch1);
      pch1 = strtok_r(NULL, "\"", &save_ptr1);
      in = false;
      continue;
    }
    pch2 = strtok_r(pch1, " ", &save_ptr2);
    while (pch2 != NULL) {
      printf ("%s\n",pch2);
      pch2 = strtok_r(NULL, " ", &save_ptr2);
    }
    pch1 = strtok_r(NULL, "\"", &save_ptr1);
    in = true;
  }
}

References

Community
  • 1
  • 1
Spundun
  • 3,936
  • 2
  • 23
  • 36