1

I'm writing a C program that parses user input into a char, and two strings of set length. The user input is stored into a buffer using fgets, and then parsed with sscanf. The trouble is, the three fields have a maximum length. If a string exceeds this length, the remaining characters before the next whitespace should be consumed/discarded.

#include <stdio.h>
#define IN_BUF_SIZE 256

int main(void) {
    char inputStr[IN_BUF_SIZE];
    char command;
    char firstname[6];
    char surname[6];

    fgets(inputStr, IN_BUF_SIZE, stdin);
    sscanf(inputStr, "%c %5s %5s", &command, firstname, surname);
    printf("%c %s %s\n", command, firstname, surname);
}

So, with an input of
a bbbbbbbb cc
the desired output would be
a bbbbb cc
but is instead the output is
a bbbbb bbb

Using a format specifier "%c%*s %5s%*s %5s%*s" runs into the opposite problem, where each substring needs to exceed the set length to get to the desired outcome.

Is there way to achieve this by using format specifiers, or is the only way saving the substrings in buffers of their own before cutting them down to the desired length?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 2
    I don't think `scanf()`'s field descriptor language and overall semantics are jointly powerful enough to express your requirements in a single format string passed to a single function call. You could, however, achieve what you want via a *series* of `scanf()` calls. – John Bollinger Mar 29 '17 at 22:44
  • 1
    See [How to use `sscanf()` in loops?](http://stackoverflow.com/questions/3975236) for an exposition on what @JohnBollinger mentions. You might want to think about making the command into a `%1s` rather than a `%c` for consistency (remembering to allocate space for the null byte too). See also [How to prevent `scanf()` from causing buffer overflows?](https://stackoverflow.com/questions/1621394/) — though you're obviously aware of a major part of the answer (you use `%ns` where `n` is a number). – Jonathan Leffler Mar 29 '17 at 22:48
  • If you're doing things without `sscanf()`, consider `strcspn()` — and/or `strpbrk()` — and maybe `strspn()` too; these would allow you to find the white space and decide how long the fields are and what you're going to do about it. – Jonathan Leffler Mar 29 '17 at 22:55

2 Answers2

2

In addition to the other answers, never forget when facing string parsing problems, you always have the option of simply walking a pointer down the string to accomplish any type parsing you require. When you read your string into buffer (my buf below), you have an array of characters you are free to analyze manually (either with array indexes, e.g. buffer[i] or by assigning a pointer to the beginning, e.g. char *p = buffer;) With your string, you have the following in buffer with p pointing to the first character in buffer:

--------------------------------
|a| |b|b|b|b|b|b|b|b| |c|c|\n|0|    contents
--------------------------------
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4      index
 |
 p

To test the character pointed to by p, you simply dereference the pointer, e.g. *p. So to test whether you have an initial character between a-z followed by a space at the beginning of buffer, you simply need do:

    /* validate first char is 'a-z' and followed by ' ' */
    if (*p && 'a' <= *p && *p <= 'z' && *(p + 1) == ' ') {
        cmd = *p;
        p += 2;     /* advance pointer to next char following ' ' */
    }

note:, you are testing *p first, (which is the shorthand for *p != 0 or the equivalent *p != '\0') to validate the string is not empty (e.g. the first char isn't the nul-byte) before proceeding with further tests. You would also include an else { /* handle error */ } in the event any one of the tests failed (meaning you have no command followed by a space).

When you are done, your are left with p pointing to the third character in buffer, e.g.:

--------------------------------
|a| |b|b|b|b|b|b|b|b| |c|c|\n|0|    contents
--------------------------------
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4      index
     |
     p

Now your job is simply, just advance by no more than 5 characters (or until the next space is encountered, assigning the characters to firstname and then nul-terminate following the last character:

    /* read up to NLIM chars into fname */
    for (n = 0; n < NMLIM && *p && *p != ' ' && *p != '\n'; p++)
        fname[n++] = *p;
    fname[n] = 0;           /* nul terminate */

note: since fgets reads and includes the trailing '\n' in buffer, you should also test for the newline.

When you exit the loop, p is pointing to the seventh character in the buffer as follows:

--------------------------------
|a| |b|b|b|b|b|b|b|b| |c|c|\n|0|    contents
--------------------------------
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4      index
             |
             p

You now simply read forward until you encounter the next space and then advance past the space, e.g.:

    /* discard remaining chars up to next ' ' */
    while (*p && *p != ' ') p++;

    p++;    /* advance to next char */

note: if you exited the firstname loop pointing at a space, the above code does not execute.

Finally, all you do is repeat the same loop for surname that you did for firstname. Putting all the pieces of the puzzle together, you could do something similar to the following:

#include <stdio.h>

enum { NMLIM = 5, BUFSIZE = 256 };

int main (void) {

    char buf[BUFSIZE] = "";

    while (fgets (buf, BUFSIZE, stdin)) {
        char *p = buf, cmd,                 /* start & end pointers */
            fname[NMLIM+1] = "",
            sname[NMLIM+1] = "";
        size_t n = 0;

        /* validate first char is 'a-z' and followed by ' ' */
        if (*p && 'a' <= *p && *p <= 'z' && *(p + 1) == ' ') {
            cmd = *p;
            p += 2;     /* advance pointer to next char following ' ' */
        }
        else {  /* handle error */
            fprintf (stderr, "error: no single command followed by space.\n");
            return 1;
        }

        /* read up to NLIM chars into fname */
        for (n = 0; n < NMLIM && *p && *p != ' ' && *p != '\n'; p++)
            fname[n++] = *p;
        fname[n] = 0;           /* nul terminate */

        /* discard remaining chars up to next ' ' */
        while (*p && *p != ' ') p++;

        p++;    /* advance to next char */

        /* read up to NLIM chars into sname */
        for (n = 0; n < NMLIM && *p && *p != ' ' && *p != '\n'; p++)
            sname[n++] = *p;
        sname[n] = 0;           /* nul terminate */

        printf ("input  : %soutput : %c %s %s\n",
                buf, cmd, fname, sname);
    }

    return 0;
}

Example Use/Output

$ echo  "a bbbbbbbb cc" | ./bin/walkptr
input  : a bbbbbbbb cc
output : a bbbbb cc

Look things over an let me know if you have any questions. No matter how elaborate the string or what you need from it, you can always get what you need by simply walking a pointer (or a pair of pointers) down the length of the string.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • 1
    Introducing new users to *walking a pointer* is worth its weight in gold if they get it. You are never faced with anything you can't parse from that point forward. The issue then becomes choosing the right tool for the job, knowing that for longer strings, the efficiencies provided by the standard library functions can really help (with the loop unrolling, etc. they incorporate for that purpose). A good example is looking at the `strlen` code. – David C. Rankin Mar 30 '17 at 05:46
  • pipe echo output to stdin is ok. but just stdin, there is some issue. You may want break the while fgets loop. – jian Aug 11 '22 at 12:03
  • @jian I like it, now you are schooling me! What is the issue you have spotted? `buf` is just split into `fname` and `sname` for each line of input read. That's why there is the `while(fgets(...)) {...}`. just in case you have multiple lines of input. Here `buf` is being read from `stdin` but you can redirect a file on `stdin` with `./bin/walkptr < myfile` if you want to provide input from a file, or you can use a *herestring* on bash `./bin/walkptr <<< "a bbbbbbbb cc"`, either way, it doesn't matter where the input comes from. – David C. Rankin Aug 11 '22 at 17:23
1

One way to split the input buffer as OP desires is to use multiple calls to sscanf(), and to use the %n conversion specifier to keep track of the number of characters read. In the code below, the input string is scanned in three stages.

First, the pointer strPos is assigned to point to the first character of inputStr. Then the input string is scanned with " %c%n%*[^ ]%n". This format string skips over any initial whitespaces that a user might enter before the first character, and stores the first character in command. The %n directive tells sscanf() to store the number of characters read so far in the variable n; then the *[^ ] directive tells sscanf() to read and ignore any characters until a whitespace character is encountered. This effectively skips over any remaining characters that were entered after the initial command character. The %n directive appears again, and overwrites the previous value with the number of characters read until this point. The reason for using %n twice is that, if the user enters a character followed by a whitespace (as expected), the second directive will find no matches, and sscanf() will exit without ever reaching the final %n directive.

The pointer strPos is moved to the beginning of the remaining string by adding n to it, and sscanf() is called a second time, this time with "%5s%n%*[^ ]%n". Here, up to 5 characters are read into the character array firstname[], the number of characters read is saved by the %n directive, any remaining non-whitespace characters are read and ignored, and finally, if the scan made it this far, the number of characters read is saved again.

strPos is increased by n again, and the final scan only needs "%s" to complete the task.

Note that the return value of fgets() is checked to be sure that it was successful. The call to fgets() was changed slightly to:

fgets(inputStr, sizeof inputStr, stdin)

The sizeof operator is used here instead of IN_BUF_SIZE. This way, if the declaration of inputStr is changed later, this line of code will still be correct. Note that the sizeof operator works here because inputStr is an array, and arrays do not decay to pointers in sizeof expressions. But, if inputStr were passed into a function, sizeof could not be used in this way inside the function, because arrays decay to pointers in most expressions, including function calls. Some, @DavidC.Rankin, prefer constants as OP has used. If this seems confusing, I would suggest sticking with the constant IN_BUF_SIZE.

Also note that the return values for each of the calls to sscanf() are checked to be certain that the input matches expectations. For example, if the user enters a command and a first name, but no surname, the program will print an error message and exit. It is worth pointing out that, if the user enters say, a command character and first name only, after the second sscanf() the match may have failed on \n, and strPtr is then incremented to point to the \0 and so is still in bounds. But this relies on the newline being in the string. With no newline, the match might fail on \0, and then strPtr would be incremented out of bounds before the next call to sscanf(). Fortunately, fgets() retains the newline, unless the input line is larger than the specified size of the buffer. Then there is no \n, only the \0 terminator. A more robust program would check the input string for \n, and add one if needed. It would not hurt to increase the size of IN_BUF_SIZE.

#include <stdio.h>
#include <stdlib.h>

#define IN_BUF_SIZE 256

int main(void)
{
    char inputStr[IN_BUF_SIZE];
    char command;
    char firstname[6];
    char surname[6];
    char *strPos = inputStr;        // next scan location
    int n = 0;                      // holds number of characters read

    if (fgets(inputStr, sizeof inputStr, stdin) == NULL) {
        fprintf(stderr, "Error in fgets()\n");
        exit(EXIT_FAILURE);
    }

    if (sscanf(strPos, " %c%n%*[^ ]%n", &command, &n, &n) < 1) {
        fprintf(stderr, "Input formatting error: command\n");
        exit(EXIT_FAILURE);
    }

    strPos += n;
    if (sscanf(strPos, "%5s%n%*[^ ]%n", firstname, &n, &n) < 1) {
        fprintf(stderr, "Input formatting error: firstname\n");
        exit(EXIT_FAILURE);
    }

    strPos += n;
    if (sscanf(strPos, "%5s", surname) < 1) {
        fprintf(stderr, "Input formatting error: surname\n");
        exit(EXIT_FAILURE);
    }

    printf("%c %s %s\n", command, firstname, surname);
}

Sample interaction:

a Zaphod Beeblebrox
a Zapho Beebl

The fscanf() functions have a reputation for being subtle and error-prone; the format strings used above may seem a little bit tricky. By writing a function to skip to the next word in the input string, the calls to sscanf() can be simplified. In the code below, skipToNext() takes a pointer to a string as input; if the first character of the string is a \0 terminator, the pointer is returned unchanged. All initial non-whitespace characters are skipped over, then any whitespace characters are skipped, up to the next non-whitespace character (which may be a \0). A pointer is returned to this non-whitespace character.

The resulting program is a little bit longer than the previous program, but it may be easier to understand, and it certainly has simpler format strings. This program does differ from the first in that it no longer accepts leading whitespace in the string. If the user enters whitespace before the command character, this is considered erroneous input.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define IN_BUF_SIZE 256

char * skipToNext(char *);

int main(void)
{
    char inputStr[IN_BUF_SIZE];
    char command;
    char firstname[6];
    char surname[6];
    char *strPos = inputStr;        // next scan location

    if (fgets(inputStr, sizeof inputStr, stdin) == NULL) {
        fprintf(stderr, "Error in fgets()\n");
        exit(EXIT_FAILURE);
    }

    if (sscanf(strPos, "%c", &command) != 1 || isspace(command)) {
        fprintf(stderr, "Input formatting error: command\n");
        exit(EXIT_FAILURE);
    }

    strPos = skipToNext(strPos);
    if (sscanf(strPos, "%5s", firstname) != 1) {
        fprintf(stderr, "Input formatting error: firstname\n");
        exit(EXIT_FAILURE);
    }

    strPos = skipToNext(strPos);
    if (sscanf(strPos, "%5s", surname) != 1) {
        fprintf(stderr, "Input formatting error: surname\n");
        exit(EXIT_FAILURE);
    }

    printf("%c %s %s\n", command, firstname, surname);
}

char * skipToNext(char *c)
{
    int inWord = isspace(*c) ? 0 : 1;

    if (inWord && *c != '\0') {
        while (!isspace(*c)) {
            ++c;
        }
    }

    inWord = 0;

    while (isspace(*c)) {
        ++c;
    }

    return c;
}
Community
  • 1
  • 1
ad absurdum
  • 19,498
  • 5
  • 37
  • 60
  • Just a couple of notes, `fgets(inputStr, sizeof inputStr, ...` only works in the scope where `inputstr` is declared, you have a constant, better `fgets(inputStr, IN_BUF_SIZE, ...`. While syntactically fine, `sscanf(strPos, " %c%n%*[^ ]%n", &command, &n, &n)` -- there is only one `n`, I'm not sure the intent of the validation value of assigning it twice. Just a nit, `strPos` -> `strpos`, leave `camelCase` to C++, nothing wrong with it, just style -- so it is up to you. – David C. Rankin Mar 30 '17 at 04:28
  • @DavidC.Rankin-- Agree on `camelCase`, but OP used it, so I did too for consistency. I tend to prefer `sizeof ...` to a constant to protect from future changes in declarations; but that is a good note, and I will add a comment in my answer. Maybe I wasn't clear enough about this in my answer, but the assignment to `n` twice is because the second `sscanf()` match fails if the first match was the right size, i.e., a character followed by a space, or a string of 5 characters. In this case, the first assignment gives `n` the correct character count. If the scan reaches the end, `n` is overwritten. – ad absurdum Mar 30 '17 at 05:00
  • Yes, I was pretty sure that was the case, but I always try to flag it for the OP because believe it or not, those little style issues speak loudly as the first impression your code makes. As for `n`, yes, if the `return` is `1` I guess you have your first `n` to check, but then that leaves you knowing your next char is `' '` or `EOF`. – David C. Rankin Mar 30 '17 at 05:37
  • @DavidC.Rankin-- In the first program, `sscanf()` will return `1` if one or both matches are made since the second is suppressed, and `%n` doesn't increment the assignment count. If the second match fails, and there is no preceding `%n`, then nothing is assigned to `n` and the next scan position is unknown. If the second match is successful and there is no following `%n`, the wrong value is stored in `n`. Hence, both assignments are necessary. – ad absurdum Mar 30 '17 at 13:27