How to tokenize sentences with symbols in C

Question

I'm trying to figure out how to tokenize Unix commands but I do not know how to work around the fact that strtok() splits any character you have as the delimiter. E.g. strtok(string, ". ") will remove the period AND the space.

The string I'm trying to have tokenize might be something like ps aux( sort ( more and there are spaces before and after the parenthesis.

Even if I do strtok(string, "(") there's still whitespace before or after the words and apparently execvp() doesn't recognize those tokens. E.g

ps aux 
 sort
 more

The output I'm expecting is

ps aux
sort
more

Are there any other functions that allow specific inputs like " ( " for it to be split for tokens?

Just write a simple function to remove leading whitespaces from each string. — kaylum, Feb 09 '20 at 05:42
The larger issue is what you are calling `"("` parenthesis, should actually be *pipes*, e.g. `'|'` tying the `stdout` from the prior command (e.g. `ps aux`) to the `stdin` for `sort`. This you have to handle using pipes in C as well and involves more than simply tokenizing the string to send to `execvp`. I'll see if I can find the duplicate for that question.. — David C. Rankin, Feb 09 '20 at 08:29
See [Pipe function in Linux shell write in C](https://stackoverflow.com/questions/36156341/pipe-function-in-linux-shell-write-in-c) and [Simple shell with pipe( ) function](https://stackoverflow.com/questions/26788603/simple-shell-with-pipe-function) — David C. Rankin, Feb 09 '20 at 08:34
It's possible that `strtok()` is not the correct function to be using. It certainly isn't the function I'd be using because it zaps the delimiter. Look up `strspn()` and `strcspn()` — they can be useful. — Jonathan Leffler, Feb 09 '20 at 15:55

chqrlie · Answer 1 · 2020-02-09T16:48:16.193

Do not use strtok for this, it is not the right tool for accurate parsing.

You can use strspn() and strcspn() to scan the string for separators without modifying the string.

Here is a simplistic example:

#include <stdio.h>
#include <string.h>

void parse_line(const char *buf) {
    int pos, len;

    for (pos = 0; buf[pos]; pos += len) {
        len = strspn(buf + pos, " \t\r\n");     // skip blanks
        if (len > 0) {
            continue;
        }
        len = strspn(buf + pos, "<>|&[]()");
        if (len > 0) {
            printf("operator %.*s\n", len, buf + pos);
            continue;
        }
        if (buf[pos] == '\'') {
            len = 1 + strcspn(buf + pos + 1, "'");
            if (buf[pos + len] != '\'') {
                printf("unterminated string: %.*s\n", len, buf + pos);
                break;
            }
            len += 1;
            printf("string: %.*s\n", len, buf + pos);
            continue;
        }
        if (buf[pos] == '\"') {
            len = 1 + strcspn(buf + pos + 1, "\"");
            if (buf[pos + len] != '\"') {
                printf("unterminated string: %.*s\n", len, buf + pos);
                break;
            }
            len += 1;
            printf("string: %.*s\n", len, buf + pos);
            continue;
        }
        len = strcspn(buf + pos, "\'\" \t\r\n<>|&[]()");
        printf("token: %.*s\n", len, buf + pos);
    }
}

int main() {
    char buf[128];

    while (fgets(buf, sizeof buf, stdin)) {
        parse_line(buf);
    }
    return 0;
}

tshiono · Answer 2 · 2020-02-09T23:05:40.320

Assuming:

You want to split a line on a left paren (.
The left paren may be preceded and/or followed by a whitespace. (judging from your input, there are no whitespaces between aux and the following left paren)

Then how about an awk solution:

str="ps aux( sort ( more"
awk -F ' *\\( *' '{ for (i=1; i<=NF; i++) print $i}' <<< "$str"

Output:

ps aux
sort
more

The -F option determines the input field separator.
The pattern ' *\\( *' is a regex which matches a left paren with 0 or more whitespaces before and/or after it.

If my assumption is incorrect, please let me know.

[EDIT]

If you prefer a C solution, following code will be a help to start:

#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int
main(void)
{
    regex_t    preg;
    char       *string = "ps aux( sort ( more";
    char       *pattern = " *\( *";    // regex of the delimiter
    char       out[256];               // output buffer
    int        rc;
    size_t     nmatch = 1;
    regmatch_t pmatch[1];

    // compile the regex
    if (0 != (rc = regcomp(&preg, pattern, 0))) {
        printf("regcomp() failed, returning nonzero (%d)\n", rc);
        exit(EXIT_FAILURE);
    }

    // loop while the regex of delimiter is found
    while (0 == (rc = regexec(&preg, string, nmatch, pmatch, 0))) {
        strncpy(out, string, pmatch[0].rm_so);  // copy the substring to print
        out[pmatch[0].rm_so] = 0;       // terminate the string
        printf("%s\n", out);
        string += pmatch[0].rm_eo;      // seek the pointer to the start of the next token
    }
    // print the last remaining portion
    if (strlen(string) > 0) {
        printf("%s\n", string);
    }
    regfree(&preg);
    return 0;
}

[Explanation]
If regexec() succeeds, it returns the "start position of the matched substring" in pmatch[0].rm_so and the "next to end position of the matched substring" in pmatch[0].rm_eo as follows:

1st call of regexec()
string:  ps aux( sort ( more
               ^ ^
           rm_so rm_eo

We can interpret them as: pmatch[0].rm_so holds the length of the 1st token and pmatch[0].rm_eo indicates the start position of the next token. Then we update the variables and invoke the 2nd regexec():

2nd call of regexec()
string:  sort ( more
             ^  ^
         rm_so  rm_eo

We repeat the loop until regexec() returns a non-zero value, meaning no more match. Then the last token will remain in string.

score 0 · Answer 3 · answered Feb 09 '20 at 08:41

0

To my knowledge, (ANSI) C does not have any more powerful tools than that, but if you must use it, you could give it a try with regex library, only you might have to do some of the work yourself (I don't know if GNULib has a regex_replace_all functionality for example).

You might want to have a look at this.

An inventory of regex libs and more about this topic can also be found here.

PS: This should rather be a comment, but I don't have the rights to write one

answered Feb 09 '20 at 08:41

zmb

61
5

1

Actually it does: `strspn()` and `strcspn()` from `` return the number of characters in the token and do not modify the string. Elaborate parsing such as handling escape sequences requires more code but these often overlooked functions are a good tool to get started. – chqrlie Feb 09 '20 at 16:32

How to tokenize sentences with symbols in C

3 Answers3