7

NOTE: I've seen the post What is the cin analougus of scanf formatted input? before asking the question and the post doesn't solve my problem here. The post seeks for C++-way to do it, but as I mentioned already, it is inconvenient to just use C++-way to do it sometimes and I have clear examples for that.

I am trying to read data from an istream object, and sometimes it is inconvenient to just use C++-style ways such as operator>>, e.g. the data are in special form 123:456 so you have to imbue to make ':' as space (which is very hacky, as opposed to %d:%d in scanf), or 00123 where you want to read as string and convert decimal instead of octal (as opposed to %d in scanf), and possibly many other cases.

The reason I chose istream as interface is because it can be derived and therefore more flexible. For example, we can create in-memory streams, or some customized streams that generated on the fly, etc. C-style FILE*, on the other hand, is very limited, at least in a standard-compliant way, on creating customized streams.

So my questions is, is there a way to do scanf-like data extraction on istream object? I think fscanf internally read character by character from FILE* using fgetc, while istream also provides such interface. So it is possible by just copying and pasting the code of fscanf and replace the FILE* with the istream object, but that's very hacky. Is there a smarter and cleaner way, or is there some existing work on this?

Thanks.

Community
  • 1
  • 1
Kan Li
  • 8,557
  • 8
  • 53
  • 93
  • Possibly related [What is the cin analougus of scanf formatted input?](http://stackoverflow.com/questions/14330637/what-is-the-cin-analougus-of-scanf-formatted-input) – Scis Jun 19 '14 at 08:39
  • 1
    @Scis: Only very loosely. It's certainly not a duplicate. – Lightness Races in Orbit Jun 19 '14 at 09:47
  • @LightnessRacesinOrbit That's why I did not flag it as such :) Just suggested it could possibly be useful\relevant in some way (even if loosely). – Scis Jun 19 '14 at 13:02
  • It appears that `fmemopen` and `open_memstream` are in POSIX.1-2008. However, `fopencookie`, which would give you a complete replacement of `istream`, is a GNU extension. Pity. – tmyklebu Jun 19 '14 at 18:42
  • @tmyklebu, it is very interesting to know the fopencookie. But I've decided to use istream already. Also I didn't mention in my original post, that I use template and I rely on overload on operator>> to handle most of the cases and the use of scanf on istream is for some corner cases. So thanks for the pointers on fopencookie but I would still seek for fscanf on istream though. – Kan Li Jun 19 '14 at 19:12
  • @icando: Well, you chose to use `istream` and now you're paying the price. :) – tmyklebu Jun 19 '14 at 19:15
  • @tmyklebu, I was thinking I might be able to wrap the `istream` as a `FILE*` by `fopencookie`, however, I realized `fscanf` can read ahead and put characters back to the `FILE*`, while `fopencookie` doesn't support intercepting `ungetc` call. See my related question: http://stackoverflow.com/questions/24318525/what-happens-if-ungetc-is-called-on-a-file-created-by-fopencookie – Kan Li Jun 20 '14 at 01:28

3 Answers3

10

You should never, under any circumstances, use scanf or its relatives for anything, for three reasons:

  1. Many format strings, including for instance all the simple uses of %s, are just as dangerous as gets.
  2. It is almost impossible to recover from malformed input, because scanf does not tell you how far in characters into the input it got when it hit something unexpected.
  3. Numeric overflow triggers undefined behavior: yes, that means scanf is allowed to crash the entire program if a numeric field in the input has too many digits.

Prior to C++11, the C++ specification defined istream formatted input of numbers in terms of scanf, which means that last objection is very likely to apply to them as well! (In C++11 the specification is changed to use strto* instead and to do something predictable if that detects overflow.)

What you should do instead is: read entire lines of input into std::string objects with getline, hand-code logic to split them up into fields (I don't remember off the top of my head what the C++-string equivalent of strsep is, but I'm sure it exists) and then convert numeric strings to machine numbers with the strtol/strtod family of functions.

I cannot emphasize this enough: THE ONLY 100% RELIABLE WAY TO CONVERT STRINGS TO NUMBERS IN C OR C++, unless you are lucky enough to have a C++ runtime that is already C++11-conformant in this regard, IS WITH THE strto* FUNCTIONS, and you must use them correctly:

errno = 0;
result = strtoX(s, &ends, 10); // omit 10 for floats
if (s == ends || *ends || errno)
  parse_error();

(The OpenBSD manpages, linked above, explain why you have to do this fairly convoluted thing.)

(If you're clever, you can use ends and some manual logic to skip that colon, instead of strsep.)

zwol
  • 135,547
  • 38
  • 252
  • 361
  • The C++ numeric extractors store the maximum value and set `failbit` on integer overflow. N3936 §22.4.2.1.2/3.3. – Potatoswatter Jun 20 '14 at 01:27
  • `istream` numeric extraction is specified in terms of `std::num_get`, which is not specified in terms of `scanf`, although the standard does briefly mention scanf-like conversion specifiers they are for exposition, not normative specification. The actual conversion is done by `strto*` as you recommend. I'm retracting the upvote for now… – Potatoswatter Jun 20 '14 at 01:33
  • @Potatoswatter Oh, cool, this is fixed in C++14! I'll have to look tomorrow (when I'm on my work computer, which has the PDF, this one doesn't) and see if that was already in C++11. In -98, however, the conversion was as-if by `scanf`, *not* `strto*`. – zwol Jun 20 '14 at 01:36
  • I checked; it was fixed in C++11. C++98 and C++03 specified that `num_get` must return values and errors that `scanf` would, but they avoid implying that `scanf` should or must be called. – Potatoswatter Jun 20 '14 at 01:41
  • Is (3) considered a defect in the C standard? It seems like the sort of freakish and pointlessly dangerous thing that can't even be used to get a better score on the SPEC benchmarks. And not really the sort of thing worth working around. – tmyklebu Jun 20 '14 at 18:14
  • 1
    @tmyklebu I do consider it a defect, but fixing it would be nontrivial and since I consider `scanf` not fit for purpose anyway, I can't be bothered to write up a DR myself. The offending sentence is at the end of C99 7.19.6.2p10: "If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined." This was probably left undefined because *library implementations* didn't agree; compilers at the time of C89 didn't optimize on the assumption UB never happens, so it was much less problematic than it is now. – zwol Jun 20 '14 at 19:13
  • @Potatoswatter You're right that it doesn't say it has to *call* `scanf`, but what it (C++98 22.2.2.1.1p11) does say is "A sequence of `char`s has been accumulated in stage 2 that is converted (*according to the rules of `scanf`*) to a value of the type of `val`.)" Emphasis mine; "according to the rules of `scanf`" means that *all* of the behavior specified by C for `scanf` (specifically C89, since this is C++98) applies to the behavior of `num_get`; in particular, the behavior of `num_get` is undefined when the behavior of `scanf` is undefined. – zwol Jun 20 '14 at 19:22
  • @Zack: Yeah, it kinda seems like a careless mistake. "You used the wrong conversion specifier" should certainly be UB but "your `short` is too short to hold `-32769`" prooobably shouldn't result in nasal demons. – tmyklebu Jun 20 '14 at 19:23
  • @Potatoswatter And my reading of C++11 agrees with yours; I have edited the answer accordingly. – zwol Jun 20 '14 at 19:26
  • @tmyklebu I used to think it was just sloppy wording, but actually I bet if you tested 1989-era C libraries, you'd find that some of them wrote garbage to the out-parameter, some of them treated it as a conversion failure (so `scanf` returns early and doesn't touch the value), and some of them behaved as `strtol` does (return the largest representable value with the appropriate sign, set `errno`) and the committee decided to punt. Unfortunately, there is no distinction in the standard between this kind of UB and the kind that the compiler is allowed to assume never happens. – zwol Jun 20 '14 at 19:29
  • @Zack: Oh eek; that's far worse than I thought. – tmyklebu Jun 20 '14 at 19:31
  • Yes, "according to the rules of `scanf`" means that UB is still UB, but it also means that the C++ library is allowed to re-implement and fix `scanf` bugs. For example, a C++11-capable library used in C++98 mode will almost certainly not call `scanf`. Also, note that stage 2 strips the separators and "accumulates" a new string before converting by those rules. So using `scanf` directly was never an alternative, at most a C++ library would re-buffer the input and call `sscanf`. – Potatoswatter Jun 21 '14 at 00:47
  • 1
    @Potatoswatter Sure, the library implementors are free to do that, but the *application* programmer cannot assume that they have, and the *compiler* implementors (pre-C++11) are free to delete numeric-overflow recovery code on the assumption it never happens. – zwol Jun 21 '14 at 12:59
  • It would be worth at least a passing mention of reliable mechanisms for parsing input according to rules defined using format strings, e.g. Boost::Qi and various parser-lexer-generators. – Ben Voigt Apr 27 '15 at 14:52
  • @BenVoigt I am not personally familiar with any of those except the old-fashioned `lex`/`yacc` combo, which is awkward to use from C++ even with GNU Bison's extensions to make it less awkward. If you were to write another answer discussing that point I'd upvote it ;-) – zwol Apr 27 '15 at 16:34
2

I do not recommend you to mix C++ input output and C input output. No that they are really incompatible but they could just plain interoperate wrong.

For example Oracle docs recommend not to mix it http://www.oracle.com/technetwork/articles/servers-storage-dev/mixingcandcpluspluscode-305840.html

But no one stops you from reading data into the buffer and parsing it with standard c functions like sscanf.

...
string curString;
int a, b;
...

std::getline(inputStream, curString);

int sscanfResult == sscanf(curString.cstr(), "%d:%d", &a, &b);

if (2 != sscanfResult)
   throw "error";
...

But it won't help in some situations when your stream is just one long contiguous sequence of symbols(like some string turned into memory stream).

Making your own fscanf from scratch or porting(?) the original CRT function actually isn't the worst possible idea. Just make sure you have tested it thoroughly(low level custom char manipulation was always a source of pain in C).

I've never really tried the boost\spirit and such parsing infrastructure could really be an overkill for your project. But boost libraries are usually well tested and designed. You could at least try to use it.

Eugene Podskal
  • 10,270
  • 5
  • 31
  • 53
  • The problem for sscanf is, in case return value is less than the number of items to parse, can't tell because there is some error in parsing, or the underlying string is used up (so you'll have to read more from the inputStream and try again). – Kan Li Jun 19 '14 at 17:24
  • 1
    I' ve told about it( 'when your stream is just one long...').Just may be not in the best way. – Eugene Podskal Jun 19 '14 at 18:45
  • what you mentioned is definitely a good example. But a more common case is that `scanf` accept multiple lines. For example, if input is `char buf[] = "123 456\n789"`, then `scanf(buf, "%d%d%d", &a, &b, &c)` will work, but `getline` will miss the third one. – Kan Li Jun 19 '14 at 18:50
  • The main problem with arbitrary long stream is when will the getline returns. If your stream is an endless line this method won't be of any help for you. – Eugene Podskal Jun 19 '14 at 18:58
0

Based on @tmyklebu's comment, I implemented streamScanf which wraps istream as FILE* via fopencookie: https://github.com/likan999/codejam/blob/master/Common/StreamScanf.cpp

Kan Li
  • 8,557
  • 8
  • 53
  • 93