Why leading zeros are ignored when input?

Question

When you write int i; cin >> i; and input 00325 or 325, both of the inputs behave like i = 325;. But why? I think is is more natural if i's value was set to 0 when you input 00325, and the next cin >>i; gave rise to i = 0; and the next cin >> i; gave rise to i = 325;. I know 00325 can be an octal number but I now input number as a decimal number (I'm not using oct manipulator).

What written evidence guarantees this behavior? I took a quick look at N4140 but couldn't find any evidence.

Note: What I want is not to know the way of keep the preceding zeros when output, which is discussed in How can I pad an int with leading zeros when using cout << operator?.

My question will be solved as soon as someone just give me the sentence like

preceding zeros are ignored when setting a number-type object as the right operand of <<

in some reliable document.

Numbers 01 and 1 are equal in mathematics, 01 does not mean 0 and 1 there. If you want get leading zeros read the stream into string. — 273K, Mar 22 '18 at 05:57
@S.M. Thank you for your comment. As I wrote in my question, what I want is to get a written evidence which guarantees the behavior you taught me. I know the behavior as a fact but don't judge if this behavior is decided by standard or not. — ynn, Mar 22 '18 at 06:00
Why you expect that the C++ standard violates mathematical rules? — 273K, Mar 22 '18 at 06:02
@S.M. No. What I want is not to say C++ has some contradiction. However, even if `01` and `1` are the same in mathematics, isn't is dangerous to believe C++ follows the mathematical rules with no evidence? Is this behavior truly truly trivial? I want the proof. — ynn, Mar 22 '18 at 06:04
A complete answer would need to quote several pages of both C++ and C standards, so is too long for this format. Working of `istream::operator>>()` for `int` is described in terms of classes and member functions in the localization (sic) library. Those are described as equivalent to using `%i` format in the C I/O library. In the C standards, the effect of `%i` is specified in terms of a function `wcstol()`. Behaviour of `wcstol()` (and related functions) is (after sign character, if present) all digits are read/interpreted. The effect is ignoring leading zeros (i.e. `01` equivalent to `1`). — Peter, Mar 22 '18 at 06:40
@Peter I suspect that it's defined in terms of the `%d` specifier (`%i` allows to specify `0x` for hexadecimal and `0` for octal). — Matteo Italia, Mar 22 '18 at 06:47
@MatteoItalia - curiously, the relevant table in the 1998 standard specifies `%i`. — Peter, Mar 22 '18 at 07:17
@Peter: I don't have the C++98 standard at hand, but in C++14 it's `%i` if the base field is 0, `%d` if it is `dec` (which is the default). — Matteo Italia, Mar 22 '18 at 07:43

Matteo Italia · Answer 1 · 2018-03-22T08:15:15.363

Warning: extremely boring answer ahead

C++14 [istream.formatting.arithmetic] ¶3

operator>>(int& val);

The conversion occurs as if performed by the following code fragment (using the same notation as for the preceding code fragment):

typedef num_get<charT,istreambuf_iterator<charT,traits> > numget;
iostate err = ios_base::goodbit;
long lval;
use_facet<numget>(loc).get(*this, 0, *this, err, lval);
if (lval < numeric_limits<int>::min()) {
    err |= ios_base::failbit;
    val = numeric_limits<int>::min();
} else if (numeric_limits<int>::max() < lval) {
    err |= ios_base::failbit;
    val = numeric_limits<int>::max();
} else
    val = static_cast<int>(lval);
setstate(err);

The grunt work here is done by num_get::get, which is specified at [facet.num.get.members] ¶1:

iter_type get(iter_type in, iter_type end, ios_base& str,
     ios_base::iostate& err, long& val) const;

[...] Returns: do_get(in, end, str, err, val).

do_get in turn is defined immediately afterwards ([facet.num.get.virtuals]), which specifies in excruciating detail the exact workings of the whole shebang. I won't copy three pages' worth of pain, but just the main points.

In stage 1, an "equivalent stdio format specifier" is determined according to the stream flags, as per table 85 and 86; the default value for std::ios_base is dec | skipws, so we'll follow that path (which corresponds to %d). Also, some other locale and flag-specific characters are determined for the next stage.

In stage 2, characters are read from the stream and accumulated in a buffer; the essential point for your question is that

If it is not discarded, then a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated

So, the decision to whether keep on reading your zeroes or stop after a single zero depends on the %d above; we'll get back to it.

In stage 3, the accumulated characters are finally converted to a long

by the rules of one of the functions declared in the header <cstdlib>:

For a signed integer value, the function strtoll.

Both the %d specifier and strtoll are defined in the C standard (C++14 refers to C99); let's dig them up.

At C99 §7.19.6.2 ¶12 (when talking about fscanf) it is told that

d Matches an optionally signed decimal integer, whose format is the same as expected for the subject sequence of the strtol function with the value 10 for the base argument.

So it all boils down to strtol/strtoll, that we can find at C99 §7.20.1.4. It is specified that the longest sequence of whitespace is skipped, and then the "subject sequence" is considered:

If the value of base is zero, the expected form of the subject sequence is that of an integer constant as described in 6.4.4.1, optionally preceded by a plus or minus sign, but not including an integer suffix. If the value of base is between 2 and 36 (inclusive), the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign, but not including an integer suffix. The letters from a (or A) through z (or Z) are ascribed the values 10 through 35; only letters and digits whose ascribed values are less than that of base are permitted. If the value of base is 16, the characters 0x or 0X may optionally precede the sequence of letters and digits, following the sign if present.

The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. The subject sequence contains no characters if the input string is empty or consists entirely of white-space, or if the first non-white-space character is other than a sign or a permissible letter or digit.

If the subject sequence has the expected form and the value of base is zero, the sequence of characters starting with the first digit is interpreted as an integer constant according to the rules of 6.4.4.1. If the subject sequence has the expected form and the value of base is between 2 and 36, it is used as the base for conversion, ascribing to each letter its value as given above. If the subject sequence begins with a minus sign, the value resulting from the conversion is negated (in the return type).

(ibidem, ¶3-5)

As you can see, there are no special provisions for leading zeroes; if it is a valid digit, it goes in the subject sequence, to be processed all in the same batch.

score 1 · Answer 2 · answered Mar 22 '18 at 06:44

The leading zeros are not ignored; they are accounted for when being processed, but the resulting value saved in a variable is the same as it would be for some other textual inputs.

The same "mathematical" number can have multiple textual representations, sometimes depending on locale. For Arabic figures, 0325 is equal to +325 and equals to 325. Depending on chosen locale, the following texts may bear the same numerical value:

123,456.89
123.456,89
123456.89
123 456,89 (mind the half-space symbol as thousands separator)

Let's not forget about other writing systems for numbers, like Roman or Herbew, Chinese etc.

The problem with multiple number representations is even deeper than simple human traditions and happens in pure mathematics outside programming. It can be shown that a rational endless fraction like 12.9(9) [12.9999999…] is equal in all aspects to 13.

I think is is more natural if i's value was set to 0

That would assume that symbol "0" is treated differently from [1-9] in input parsing. But why?

Why leading zeros are ignored when input?

2 Answers2