5

For once I thought I found a good use for sscanf() but after reading about how it handles integers, it appears not. Having a string that should look like this: 123,456,678 I thought I could safely and concisely parse it with this code:

unsigned int x[3];
if( sscanf( s, "%u,%u,%u", x+0, x+1, x+2 ) == 3 )
    …

If conversion fails I'm not really interested in knowing why, nor am I worried about getting incorrect data. If there's something other than numbers in there, scanf() should surely create a matching error and abort, and it knows I'm looking for an unsigned integer, so anything negative should also be a matching error? Nope.

I got suspicious when I read about the conversion specifier %u: Matches an optionally signed decimal integer. Why would this not be a matching error? What happens if it is signed?

Quoting from ISO/IEC 9899:201x 7.21.6.2 ¶ 10, The fscanf function (emphasis mine):

Except in the case of a % specifier, the input item (or, in the case of a %n directive, the count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a *, the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.

It appears to read as if scanf() treats every integer-looking conversion specifier the same, reads the input as some kind of signed integer of unspecified size, and then writes to the output bypassing all normal conversions.

For example converting any integer (negative or positive) into an unsigned integer of smaller size is well behaved according to normal implicit conversions, but not with scanf():

unsigned int x;
x = -1;                   /* Well defined: (-1) + (UINT_MAX+1) = UINT_MAX */
sscanf( "-1", "%u", &x ); /* Undefined behavior? */

Please tell me I'm wrong and that I have missed some part of the standard. One thing that I can't really find a reference to is this part of the section quoted above: "the input item (…) is converted to a type appropriate to the conversion specifier". If the conversion specifier is %u then anything negative is of course not appropriate, nor is anything that does not fit into an unsigned integer. However, I could not find anything in the standard telling me exactly what an "appropriate type" is.

I found a handful of questions dealing with this directly or indirectly, but not in much detail. The question most similar to mine is C: How to prevent input using scanf from overflowing? but it's framed in a way that's not as specific. A few answers (1, 2) mentions the issue but offer no detail or references.

The goal of my question is to get an answer detailing exactly why this can not be interpreted in any way other than undefined behavior, and preferably some rationale as to why this makes sense - fully knowing that some things in C are inconsistent and you I have to accept it.

pipe
  • 657
  • 10
  • 27
  • Yes, it can fail. if the origin string `s` is NULL or not appropialtly terminated. – wildplasser Feb 24 '21 at 00:51
  • 1
    See also [scanf %u negative number?](https://stackoverflow.com/questions/38684386/scanf-u-negative-number). – dxiv Feb 24 '21 at 00:52
  • How do you figure `scanf` “writes to the output *bypassing* all normal conversions”? The standard says it does a conversion. Conversions are specified in C 2018 6.3. For `%u`, the appropriate type is `unsigned`. So matched input of `-1` will result in −1 being converted to `unsigned`. Conversion of −1 to `unsigned` will yield `UINT_MAX`. – Eric Postpischil Feb 24 '21 at 01:31
  • A problem here is that it never specifies how the input sequence is converted to the destination type. E.g. it doesn't say that this happens as if by `strtoul` . – M.M Feb 24 '21 at 01:32
  • Re “… I could not find anything in the standard telling me exactly what an "appropriate type" is”: Is the type that is supposed to be passed (by address) for the conversion. For `%u`, it is `unsigned`. – Eric Postpischil Feb 24 '21 at 01:33
  • @EricPostpischil I don't agree that's implied. Other parts of the specification of `scanf` contradict the "as if by `strtoul`" approach, e.g. the predicate "if the result of the conversion cannot be represented in the object" can never be satisfied, because the result of `strtoul` and family are always representible in the result type (with that result perhaps being `ULONG_MAX` with errno set to `ERANGE` for example). Also, what about `%llu` with input `ULONG_MAX+1` (as string) ? If converted as if by `strtoul` the result should be `0`. – M.M Feb 24 '21 at 02:47
  • We *could* come up with some reasonable theory about how it should behave, but the fact is that the standard doesn't actually specify whatever theory we settle on . It is just a reasonable interpolation of an underspecification . – M.M Feb 24 '21 at 02:48
  • @M.M: The predicate is satisfiable for `%f`. – Eric Postpischil Feb 24 '21 at 03:10
  • I take this extreme comment thread as a sign that my question is somewhat valid - there's at least _arguably_ some confusion and unclarity. Hope I have time to digest it before it's cleared out... – pipe Feb 24 '21 at 05:07
  • 1
    http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2483.pdf – cpplearner Feb 24 '21 at 10:29
  • @M.M: Upon further study, I think I am wrong about a number-to-number conversion being involved. I think the standard intends a direct conversion from the numeral in the string to the final type. (With either interpretation, `scanf` may have undefined behavior given a sufficiently large input, either because it is too big for the `unsigned` destination or because it is too big for the widest integer the implementation supports.) I will delete my inapplicable comments. – Eric Postpischil Feb 25 '21 at 23:54
  • @EricPostpischil we're back to square one with the behaviour on input `-1` then , since this is out of range for `unsigned int` and it's not clear whether the intent is that out-of-range input is undefined behaviour, or whether the intent is to behave like `strtoul` . – M.M Feb 26 '21 at 01:10

1 Answers1

1

For once I thought I found a good use for sscanf() but after reading about how it handles integers, it appears not

As far as the advice to ignore a tool in C because it could be dangerous, I've seen this a lot as a weapon against scanf, goto and even traditional C strings... but ultimately, the entire language presents subtle hazards, so you'd be better to follow the advice along the lines of (correctly) use the right tool for the job, and C is mostly not the right tool for most jobs! Keep this in mind; sometimes that cargo cult thinking will blind you from the best tools. Also, on the note of correctness, I'm sure you're aware that you ought to be considering return values for most standard library functions, and it's from these kinds of common omissions that such cargo cult thinking arises. To correctly use a tool, we must read its manual, and the fscanf manual makes it pretty clear how important the return value is. It's so refreshing to see people reading such manual (thanks for asking such questions)

As far as your questions go, I've itemised those that I think you want answers to, and will come back to those. However, first off, it seems you've come to a few inaccurate premises. You may have glossed over some necessary details in section 7.21.6.2, paragraph 9 (literally the paragraph before p10, which you've quoted), for example, so it's hard to say whether your understanding of the term "input item" is correct:

An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.

So in fact your later question:

What happens if it is signed?

... is essentially the same as:

What happens when the (sequence of characters) input item begins with a '-' character?

I can't say for sure what will happen, as your implementation has many options available to comply, and it seems to be down to the standard library. There are several places in the standard where optimisation is eluded to, such as the "as if" rule in section 5.1.2.3p4 and p6. The reason to leave the implementation details to the implementation is to allow the implementation chances to optimise that wouldn't have otherwise been possible. Suffice to say, a conversion will happen. In this answer I'll give one way that a standard library could comply with this requirement (a conversion), but rest assured that's just one possibility, there are many others and your compiler might even replace this code for something more optimal (a different conversion).

There are details in other sections that describe signed-to-unsigned conversions such as in section 6.3.1.3p2, without undefined behaviour:

Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.

It's to be expected that when the input begins with a negative sign, either scanf-related functions perform explicit conversions along that line of logic, or they use one of the operators (as described in 6.3 that provides that conversion. For example, in your standard library that might look something like:

int c = fgetc(file);
unsigned u = 0;
switch (c) {
    case '-':
    { int d = 0;
      while (isnum(c = fgetc(file)))
      { d *= 10;
        d -= (c - '0');
      }
      if (c >= 0) ungetc(c, file);
      u = d; // here's your signed-to-unsigned conversion, with no UB
      break;
    }
    default:
      while (isnum(c))
      { u *= 10;
        u += c;
        c = fgetc(file);
      }
      if (c >= 0) ungetc(c, file);
}

So now, it seems since we've shown how the standard library could comply in this case, on to the other question (your first one):

Why would this not be a matching error?

With a keen eye you may notice that my code could be less duplicitous. If I had to hazard a guess, they wanted to modularise code so as to reduce L1 cache thrashing (since that was once much more a problem than it is nowadays), they devised clever patterns to match all kinds of numeric data with the same logic. You could ask the same question about the pp-number element, and the answer would be the same: C would probably be much slower in practice if they didn't have an "as if" rule...

autistic
  • 1
  • 3
  • 35
  • 80