16

I have the following code snippet:

#!/usr/bin/env python3

print(float(b'5'))

Which prints 5.0 with no error (on Linux with utf-8 encoding). I'm very surprised that it doesn't give an error since Python is not supposed to know what encoding is used for the bytes object.

Any insight?

static_rtti
  • 53,760
  • 47
  • 136
  • 192
  • 1
    Have you rad the [documentation](https://docs.python.org/3/howto/unicode.html#encodings)? and https://docs.python.org/3.6/c-api/buffer.html#bufferobjects – Mazdak May 18 '18 at 10:07
  • 4
    @Kasramvd: the documentation for `float()` states it accepts a `str`, a number, or a type that implements `__float__`. `bytes` doesn't implement `__float__`. – Martijn Pieters May 18 '18 at 10:13
  • @MartijnPieters [Here](https://docs.python.org/3/library/functions.html#float) it's mentioned that If the argument is a string, it should contain a decimal number, optionally preceded by a sign, and optionally embedded in whitespace. doesn't `b'5'` follow that rule? Although it should have been specified clearly in the documentation. – Mazdak May 18 '18 at 10:17
  • 2
    Fair question, since [not all encodings are supersets of ASCII](https://stackoverflow.com/q/6531750/4014959). – PM 2Ring May 18 '18 at 10:17
  • 2
    @Kasramvd: no, it doesn't. The `bytes` type is not considered a string. – Martijn Pieters May 18 '18 at 10:24
  • @MartijnPieters Indeed, I mean since bytes represent a sequence of characters and they can also contain decimals, it should have been mentioned as well which as you mentioned it's a bug in documentation. – Mazdak May 18 '18 at 10:26

1 Answers1

13

When passed a bytes object, float() treats the contents of the object as ASCII bytes. That's sufficient here, as the conversion from string to float only accepts ASCII digits and letters, plus . and _ anyway (the only non-ASCII codepoints that would be permitted are whitespace codepoints), and this is analogous to the way int() treats bytes input.

Under the hood, the implementation does this:

  • because the input is not a string, PyNumber_Float() is called on the object (for str objects the code jumps straight to PyFloat_FromString).
  • PyNumber_Float() checks for a __float__ method, but if that's not available, it calls PyFloat_FromString()
  • PyFloat_FromString() accepts not only str objects, but any object implementing the buffer protocol. The String name is a Python 2 holdover, the Python 3 str type is called Unicode in the C implementation.
  • bytes objects implement the buffer protocol, and the PyBytes_AS_STRING macro is used to access the internal C buffer holding the bytes.
  • A combination of two internal functions named _Py_string_to_number_with_underscores() and float_from_string_inner() is then used to parse ASCII bytes into a floating point value.

For actual str strings, the CPython implementation actually converts any non-ASCII string into a sequence of ASCII bytes by only looking at ASCII codepoints in the input value, and converting any non-ASCII whitespace character to ascii 0x20 spaces, to then use the same _Py_string_to_number_with_underscores() / float_from_string_inner() combo.

I see this as a bug in the documentation and have filed issue with the Python project to have it updated.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    I know there won't be a thing about python that this guy doesn't know. – Sraw May 18 '18 at 10:26
  • Thanks for the great answer. So, just to be clear, this will fail with certain encodings, such as UTF-16? – static_rtti May 18 '18 at 11:37
  • 2
    @static_rtti: absolutely, because the `\x00` bytes won't be accepted. The bytes **must** be ASCII only, and fit the `float()` string interpretation rules. – Martijn Pieters May 18 '18 at 11:39