5

For many years, gets has been universally disparaged as being an unsafe function. (The canonical SO question is Why is the gets function so dangerous that it should not be used?). The gets function is so bad that it has been removed from the C11 language standard. Supporters of gets (there are few if any) would argue that it is perfectly fine to use it if you know about the structure of the input.

Why do people who disparage gets and acknowledge the folly of relying on the structure of the input allow the usage of %d as a scanf conversion specifier? That's a sociological question, and the real question is: why is %d in a scanf format string unsafe?

William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • 1
    In fact console I/O and stdio.h as whole is what's problematic. Programs without a GUI ought to take input either from command line arguments and/or files. And from there, sanitize the input. If anyone is still developing console I/O applications using a dysfunctional library from 1970 in a professional/commercial context around the year 2022, they ought to seriously step back and consider what the f they are doing. – Lundin Dec 22 '22 at 13:49
  • Great question by the way, after I finally read it correctly! Probably because of a commonly used phrase _"...as bad as it gets."_ I was seeing the word _it_, where it wasn't, thus changing the whole meaning of the title in my mind. Once I caught my mistake I considered editing to make _gets_ look like a function, but evidently no one else had the problem :) I blame [this](https://www.dictionary.com/e/typoglycemia/), or [this](https://indiabioscience.org/news/2020/a-new-study-explains-how-the-human-brain-recognizes-jumbled-words). – ryyker Dec 22 '22 at 15:00
  • 1
    I had the same issue; it turns out there were backticks but someone edited them out, now they are back in. – Mustafa Aydın Dec 22 '22 at 15:02
  • 1
    @MustafaAydın - glad to know I'm not the only one :) – ryyker Dec 22 '22 at 15:03
  • 1
    A clickbaity title. You can always get a memory corruption out of `gets` on an unsanitized input, while `scanf("%d", ...)` will at worst give you an unspecified integer (I know it's UB in theory, but probably not as bad in practice). – HolyBlackCat Dec 22 '22 at 15:06
  • This seems relevant: [**Disadvantages of scanf**](https://stackoverflow.com/questions/2430303/disadvantages-of-scanf) – Andrew Henle Dec 22 '22 at 15:22
  • William Pursell, I see 2 main issues here: Reading a _line_ and parsing data. Discussion could work those together or separately. IMO use of `_Generic` would help with handling parsing woes: [exmaple](https://codereview.stackexchange.com/questions/115143/formatted-print-without-the-need-to-specify-type-matching-specifiers-using-gene). – chux - Reinstate Monica Dec 22 '22 at 15:34
  • The potential issues with `scanf` are legion, but there is no equivalence or similarity between `scanf("%d",...)` and `gets()` so why are you comparing them? `scanf("%s[^\n]%*c",line);` perhaps might be a fairer comparison and is equally dangerous. The upshot however is that `scanf()` can be used both safely and unsafely where `gets()` is never safe. You only know the structure of the _intended_ input, not what some malicious or incompetent user might enter or redirect into it. – Clifford Dec 22 '22 at 16:49
  • @Clifford I disagree. Although it is possible to use `scanf` safely, it is never possible to use `scanf("%d", ...)` safely. Both `scanf("%d", ...)` and `gets` lead to undefined behavior in common circumstances. @SteveSummit makes a very good point that the UB incurred in the former is in practice less severe than the latter, but the two are certainly comparable. One leads to UB when there are too many digits in the input stream, the other leads to UB when there are too many non-newlines in the input stream. Very similar in that regard. – William Pursell Dec 22 '22 at 17:02
  • Undefined behaviour however in this context would (reasonably) mean that you do not know what value the receiving `int` takes, rather then it will overrun the bounds of that `int`. Undefined does not mean unreasonable, and is not the same as _implementation defined_. It comes about through inaction rather than specific implementation (in the real world). It is still a chalk and cheese comparison.; they do not serve the same function. – Clifford Dec 22 '22 at 17:28
  • @Clifford It's a rather delicate situation. Sometimes (and, it seems, more often lately) what a compiler chooses to do with undefined behavior *does* seem unreasonable. And to pick and choose — to say that you can live with certain kinds of undefined behavior, because you believe the consequences shouldn't be "too bad" — is a potentially perilous game. (And yet, it's precisely the same game I'm playing here: I *do* believe that `scanf("%d", …)` and `atoi("123xyz")` are vastly less dangerous than `gets()` and `i++ + i++`.) – Steve Summit Dec 26 '22 at 17:24
  • See, for example, [part two](http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html) of ["What Every C Programmer Should Know About Undefined Behavior"](http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html). – Steve Summit Dec 26 '22 at 17:36

4 Answers4

4

If the format string to scanf contains a raw %d conversion specifier ("raw" meaning "without a maximum field width"), the behavior is undefined if the input stream contains a string that is a valid representation of an integer that cannot fit in an int. eg, the string 5294967296 cannot be represented in an int on a platform where sizeof(int) == 4. The C language only guarantees that an int be large enough to hold the range -32767 thru +32767, so any input stream that contains the string 32768 could potentially lead to undefined behavior. This potential undefined behavior can be avoided by using %4d. Most modern platforms have a value of INT_MAX that is much larger than 32767, so realistically the width-modifier on the conversion specifier can be larger than 4, but it ought to be determined (either at compile time or at run time) for the platform, and it must be present in the format string.

If you don't add a width modifier, you might as well just use gets to read a line into a buffer and use sscanf to parse the values. This will (perhaps) make the error more obvious to the reader.

William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • 12
    I acknowledge the issue you're describing. I deny that in practice it is anywhere near as problematic as use of `gets()`. – John Bollinger Dec 22 '22 at 13:44
  • 5
    Or as bad as `"%s"` with no width. – Retired Ninja Dec 22 '22 at 13:48
  • 6
    I feel like this answer is begging for supercat to jump in with remarks about C implementations not arbitrarily choosing to do senseless things when UB occurs. – John Bollinger Dec 22 '22 at 13:49
  • While the maximum field width mitigates the problem the `scanf` is still vulnerable to INT_MAX+1 etc, as long as the number string fits in, say, 10 digits for a 32 bit int, isn't it? Newer versions of the standard should really make this at least implementation defined, or simply define it. – Peter - Reinstate Monica Dec 22 '22 at 13:57
  • By the way, the gets/sscanf suggestion doesn't make a lot of sense, does it? Scanning from a string or a file is not substantially different, in particular equally vulnerable to overflows. What is needed is a parsing library with defined overflow behavior. In all reality, any robust input processing of untrusted input must do low-level parsing and error handling/re-syncing, even if that is painful and error-prone. – Peter - Reinstate Monica Dec 22 '22 at 14:04
  • @Peter-ReinstateMonica Indeed, suggesting `gets/sscanf` was intended to be a bit ridiculous. I primarily advocate that no one ever use either `scanf` or `sscanf` at all, and merely posed this question to have something to link to when answering questions about `scanf`. The whole `*scanf` family ought to be condemned to the dustbin of history. – William Pursell Dec 22 '22 at 14:13
  • @supercat - You've been called out. :) (See 3rd comment.) – ryyker Dec 22 '22 at 15:12
  • @JohnBollinger *I feel like this answer is begging for supercat to jump in with remarks about C implementations not arbitrarily choosing to do senseless things when UB occurs.* But in this case, the problem of UB is real. Even if the behavior is not arbitrary, the input stream will be left in an unknown state that's well-nigh impossible to recover from reliably, especially if it's not seekable. – Andrew Henle Dec 22 '22 at 15:27
  • Do you have a citation for out-of-range integer being Undefined, rather than merely Unspecified? – Toby Speight Dec 22 '22 at 15:51
  • 1
    @TobySpeight — [§7.21.6.2 The `fscanf` function ¶10](http://port70.net/~nsz/c/c11/n1570.html#7.21.6.2p10): _[…] If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined._ – Jonathan Leffler Dec 22 '22 at 16:19
  • Thank you @Jonathan. I thought I'd read that paragraph, but I obviously went too quickly! – Toby Speight Dec 22 '22 at 16:32
4

No, scanf("%d", …) is not as bad as gets.

gets is as bad as it gets because it is not possible to use it safely, in virtually any environment. Buffer overflow is likely, cannot be prevented, and is quite likely to lead to arbitrarily bad consequences.

The worst thing that can happen with scanf("%d", …), on the other hand, is integer overflow. While this is theoretically also undefined behavior, in practice it virtually always results in either (a) quiet wraparound, (b) overflow to INT_MAX or INT_MIN, or (c) a runtime exception which may terminate the calling program.

It is extremely difficult to imagine a scenario under which an attacker could exploit a program using scanf("%d", …). Exploits involving gets, on the other hand, are commonplace.

(Although not the question asked, it's true that scanf("%s", …) is precisely as dangerous as gets. It's a fair question why the former isn't always disparaged as strenuously as the latter.)

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • I've been debating changing the title to "Why is scanf("%d", ...) unsafe?", but this answer would then be less relevant, and that is unfortunate because you bring up a great point. UB is UB, but (in practice) some UB is worse. – William Pursell Dec 22 '22 at 17:09
  • I've removed "why" from the title, which makes your first paragraph a bit obsolete but I think retains the heart of the discussion. I really appreciate your answer; you always have good insights. – William Pursell Dec 22 '22 at 17:12
  • @WilliamPursell Thanks for the heads-up. I adjusted my first paragraph. – Steve Summit Dec 22 '22 at 17:51
2

As well known, former gets() offers no control/detection of buffer overflow leading to UB. It could have had it had a size parameter.

In addition to @William Pursel good answer concerning int range.

scanf("%d", ...): Input not limited to one line.

gets() read 1 line. "%d" in scanf(), first consumes leading white-space which may include several lines.

scanf("%d", ...): does not read the whole line.

Unlike gets(), scanf("%d", ...) leaves any input after the input for the int. This often includes a '\n'. Not reading the entire lines often sets the seed for subsequent problems.

Depending on goals, scanf("%d", ...) does not complain about trailing non-numeric text.


C lacks a robust ways to read a line. IMO, fgets(), gets_s(), scanf(anything), extension getline() all lack some functionality.

I'd campaign for a int scan_line(size_t sz, char *buf /*, size_t *length_read*/) that always reads a line, always forms a string in buf and returns EOF (end-of-file, input error), 1 on success and 0 when sz is too small.


Alternatively (and more debatable) *scanf() could be improved:

  • Add ability to pass in size for "%s" and friends. This is sorely needed.

  • Defined behavior on int overflow.

  • Something like "%#\n" to scan in white-space, but not '\n'. Does not contribute to the return value.

  • Something like "%\n" to scan in 1 '\n'. Contributes to the return value. May use a leading space "% \n" to allow optional leading non-'\n' white-space.

  • Offer *scanfln() which always read just 1 line.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 1
    You can pass a size for `scanf` `"%s"` arguments already (e.g., `"%8s"`, or using `scanf_s()`, which allows/forces it to be an additional argument), can already make it read and discard specific whitespace (`"%*[ \t]"`) and similarly read only a single newline (`"%1[\n]"`). The glibc implementation can also set `errno` to `ERANGE` for failed integer conversions. – Hasturkun Dec 22 '22 at 15:56
  • 1
    @Hasturkun: I agree with you — but I think @chux wants something more akin to the `printf()` notation using `*` to indicate that the length is provided by an argument to the function, rather than requiring the size to be coded in the format string. An equivalent to the POSIX-compliant `%ms` notation that dynamically allocates the data for the string — preferably with a different notation (since `%s` expects a `char *` while `%ms` expects a `char **`) would be beneficial too. In an ideal world, the `*` would be used for the same purpose in `scanf()`, but history has pre-empted this option. – Jonathan Leffler Dec 22 '22 at 16:25
  • Maybe the `@` character (or some otherwise unused character) could be used in both `printf()` and `scanf()` families for "the length comes from an argument". The `*` notation in `printf()` could be deprecated (marked obsolescent), but would continue to be supported indefinitely. The `*` in `scanf()` would still mean "no assignment". – Jonathan Leffler Dec 22 '22 at 16:29
  • I think for secure processing of unknown input you'll always need a custom parser; perhaps a regex library would do it, if the input is line oriented. (And writing a safe, convenient wrapper around fgets() that reliably reads a line shouldn't be too hard.) It would be difficult to harden scanf against numeric overflow: There is only 1 ungetc(), so that invariably input is lost when a number is too large. – Peter - Reinstate Monica Dec 22 '22 at 16:35
  • @Hasturkun Using a variable _width_ obliges reforming the format - which is error prone with its `-1` need and prevents compile time checking for other specifiers.. `scanf_s()` introduces constraint handling and is not consistent between MS vs. the C spec concerning the type of the size. – chux - Reinstate Monica Dec 22 '22 at 18:55
  • @Hasturkun `"%*[ \t]"` only scans some non-`'\n'` white-spaces. The set of white-spaces also has a _locale_ dependency. `"%1[\n]"` is not too bad to consume a `'\n'`, yet can be expensive for weak libraries in forming the _scanset_. – chux - Reinstate Monica Dec 22 '22 at 18:58
0

gets doesn't have any means to prevent buffer overflow errors.

For scanf("%d", &x); there is no way to make buffer overflow error (it type matches format string).

Now in case of

char s[5];
scanf("%s", s); 

There is a danger of buffer overflow (when user types work with more then 4 characters), but it is easy to fix this code to protect from buffer overflow:

char s[5];
scanf("%4s", s); 

Now this version can't buffer overflow.

Note that scanf is relay bug-prone, so prevent common mistake threat warning related to format string as an error.

Basically gets there was no way to protect from invalid (to long) user input. Also ther si not way to fix it without breaking binary or source compatibility.
In case of scanf more advanced format string can protect you form buffer overflow and this can be enforced by static analysis tools.

Marek R
  • 32,568
  • 6
  • 55
  • 140