12

Take for example rc = scanf("%f", &flt); with the input 42ex. An implementation of scanf would read 42e thinking it would encounter a digit or sign after that and realize first when reading x that it didn't get that. Should it at this point push back both x and e? Or should it only push back the x.

The reason I ask is that GNU's libc will on a subsequent call to gets return ex indicating they've pushed back both x and e, but the standard says:

An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence[245] The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.

I interpret this as since 42e is a prefix of a matching input sequence (since for example 42e1 would be a matching input sequence), which should mean that it would consider 42e as a input item that should be read leaving only x unread. That would also be more convenient to implement if the stream only supports single character push back.

skyking
  • 13,817
  • 1
  • 35
  • 57
  • 3
    I think the part following the 'e' will be incomplete, causing scanf to backtrack and only recognize `42`. – Paul Ogilvie Nov 03 '16 at 15:30
  • @PaulOgilvie: That's possible, but the standard is fairly clear that there should be no backtracking. The `x` should simply not be accepted. – MSalters Nov 03 '16 at 15:36
  • Your quote does not say anything about push back in case of an error. The input stream is broken after an error. Does it make much sense to muse about the kind of corruption? – ceving Nov 03 '16 at 15:38
  • @ceving That's said in the footnote [245], there it also says that it pushes back at most one character. Also note that the implementation is bound to peek at at least one character beyond the input item. – skyking Nov 03 '16 at 15:43
  • @Ceving, I don't think there will be any corruption of the input stream. `scanf` uses the buffers for its scans. With 'backtracking' I mean the algorithm's internal working; compare it to yacc shift/reduce. – Paul Ogilvie Nov 03 '16 at 15:44
  • Other hypothesis: There's a state transition missing. The "normal" format is `42.0e03` and the `e` in `42e` isn't yet expected. Instead, the parser is still in state "integral part" looking for `[0-9.]`, and the transition _to_ state "exponent" is only from state "fractional part". – MSalters Nov 03 '16 at 15:46
  • Look at the format as `42[.0][e03]` the `e03` must have been completely parsed before it is recognized. – Paul Ogilvie Nov 03 '16 at 15:55
  • Certain it is implemation defined. A stream is obliged to be able to push back 1 character, the `x`of `"42ex"`. Yet `e` needs to push `e` back too. Some implementations do it, others do not. `sscanf()` almost _always_ pushes both back IIR. `stdin` is not required to be buffered aside from 1 character. Avoid evil `scanf()`. – chux - Reinstate Monica Nov 03 '16 at 16:18
  • A slightly worse case, consider `"42e-x"`. Need to push back 3. – chux - Reinstate Monica Nov 03 '16 at 16:20
  • Even worse: `INFINITx` – nwellnhof Nov 03 '16 at 16:56
  • I asked the same question a while ago http://stackoverflow.com/questions/26334399/what-is-the-result-of-strtod3ex-end-supposed-to-be-what-about-sscanf – AnT stands with Russia Nov 03 '16 at 17:48
  • @Paul Ogilvie: Historically, the capability of `scanf` to "return" characters back to the stream was based on the capabilities of `ungetc`. And the latter does not guarantee that it can return more than 1 character. This is why traditional implementation of `scanf` does not backtrack. – AnT stands with Russia Nov 03 '16 at 17:51
  • The worst case is of course the NaNs - there you can have `NAN(` followed by alphanumrerics, but then not finishing it of with a closing parenthesis. If the `scanf` is to leave the stream to read the opening parenthesis we could theoretically have a case where we would have to push back characters filling both available primary and secondary storage. – skyking Nov 07 '16 at 06:00
  • @skyking This may explain why the GNU C Library does not handle NaN inputs correctly, e.g. `"nan()"`: [bug 30647](https://sourceware.org/bugzilla/show_bug.cgi?id=30647). – vinc17 Aug 16 '23 at 15:00

1 Answers1

9

Your interpretation of the standard is correct. There's even an example further down in the C standard which says that 100ergs of energy shouldn't match %f%20s of %20s because 100e fails to match %f.

But most C libraries seem to implement this differently, probably due to historical reasons. I just checked the C library on macOS and it behaves like glibc. The corresponding glibc bug was closed as WONTFIX with the following explanation from Ulrich Drepper:

This is stupidity on the ISO C committee side which goes against existing practice. Any change can break existing code.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113