0

On Python 2 REPL:

>>> sys.stdin.encoding 
'UTF-8'

So my understanding is, on giving the below expression on stdin

>>> stringLiteral = 'abc'

the interpreter reads the expression from stdin in utf-8 encoding and interprets the code.

But I learnt that, in Python 2, str type stores 'abc' as a byte string, and internally in CPython it's stored as a C char * null-terminated string (i.e. an array of bytes terminated by \0).

What encoding scheme does the str class use to store 'abc' in memory? What decoding scheme does str use to print 'abc' on printing it?

Based on the answer, If I give the below expression:

>>> stringLiteralNonAsciiRange = 'abc정정'

then why does stringLiteralNonAsciiRange not print 정정? Why is the output 'abc\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'?

cs95
  • 379,657
  • 97
  • 704
  • 746
overexchange
  • 15,768
  • 30
  • 152
  • 347
  • Python 2 interprets string literals as `ASCII` bytes. `sys.stdin.encoding` is irrelevant, since a literal is not taken from `stdin` – juanpa.arrivillaga Jun 06 '17 at 20:01
  • 1
    1. ASCII by default, unless you specify unicode (the `u` prepended to the string will be an indicator). 2. Try `print repr(stringLiteralNonAsciiRange)`. – cs95 Jun 06 '17 at 20:02
  • 3
    I'll take "why python 3 is better" for 400, Alex. ;) – erip Jun 06 '17 at 20:03
  • In other words, Python 2 `str` == Python 3 `bytes`. – juanpa.arrivillaga Jun 06 '17 at 20:03
  • 1
    Just typing the name of a variable is NOT the same as printing it - it's more like `print repr(variable)`. The `repr` of a string uses escape sequences for all non-ASCII and non-printable characters, so that you can see exactly what's in the string. – jasonharper Jun 06 '17 at 20:04
  • @erip My understanding is, python 3's `bytes` type memory representation should be simialr to python 2's `str` type – overexchange Jun 06 '17 at 20:04
  • Sure. You'll need to decode those bytes as unicode. – erip Jun 06 '17 at 20:07
  • You'll also note `'\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'.decode('utf8')` gives `u'\uc815\uc815\U0001f49b'` which *is* `"정정"` – juanpa.arrivillaga Jun 06 '17 at 20:08
  • @juanpa.arrivillaga Why `abc정정'.decode('utf-8')` gives `u'abc\uc815\uc815\U0001f49b'` but not `abc정정`? which is still not clear to me. – overexchange Jun 06 '17 at 20:13
  • 2
    There's a [difference](https://stackoverflow.com/questions/1436703/difference-between-str-and-repr-in-python) between an object's `__repr__` method and an object's `__str__` method. – erip Jun 06 '17 at 20:14
  • @eriq OK. So, `decode()` output in my previous comment has nothing to do with encoding/decoding. Thankyou – overexchange Jun 06 '17 at 20:17
  • @Shiva Is ascii decoder used in both cases? `print repr(stringLiteral)` and `print repr(stringLiteralNonAsciiRange)`, as they are byte strings and nothing more than that – overexchange Jun 06 '17 at 20:26
  • 1
    Yes, in the latter case, ASCII doesn't recognise those characters, so it is printed as is (`abc\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b`). – cs95 Jun 06 '17 at 20:32
  • @Shiva If it was ascii encoding scheme used to store `stringLiteralNonAsciiRange`, then `stringLiteralNonAsciiRange.decode('ascii')` should not fail. But it fails. So it contradicts here. – overexchange Jun 06 '17 at 22:14

0 Answers0