Also: I believe the first output is of type
bytes
, but
what is the type of the second output? My guess is
str
with UTF-8
encoding.
Close, but not quite right. In Python3 the str
type is indexed by Unicode code points (note that code points usually, but not always, have a 1:1 correspondence with user perceived characters). Therefore, the underlying encoding is abstracted away when using the str
type -- consider it unencoded, even though that is fundamentally not the case. It is the bytes
type which is indexed as a simple array of bytes and therefore must use a particular encoding, In this case (as in most similar usages), ASCII
would suffice for decoding what was generated by the subprocess script.
Python2 has different defaults for the interpretation of the str
type (see here), so string literals will be represented differently in that version of the language (this difference could be a big stumbling block when researching text handling).
As a person who mostly uses C++, I found the following to be extremely enlightening about the practical storage, encoding, and indexing of Unicode text: How do I use 3 and 4-byte Unicode characters with standard C++ strings?
So the answer to the first part of the question is bytes.decode()
:
a = a.decode('ascii') ## convert from `bytes` to 'str' type
although simply using
a = a.decode() ## assumes UTF-8 encoding
will typically produce the same results since ASCII is a subset of UTF-8.
Alternatively, you can use str()
like so:
a = str(a,encoding='ascii')
but note that here an encoding must be specified if you want the "contents only" representation -- otherwise it will actually build a str
type which internally contains the quote characters (including the 'b' prefix), which is exactly what was happening in the first output shown in the question.
subprocess.check_output
processes the data in binary mode (returning the raw byte sequence) by default, but the cryptic argument universal_newlines=True
basically tells it to decode the string and represent it as text (using the str
type). This conversion to the str
type is necessary (in Python3) if you want to display the output (and "only the contents") using Python's print
function.
The funny thing about this conversion is that, for these purposes, it really doesn't do anything to the data. What happens under the hood is an implementation detail, but if the data is ASCII (as is very typical for this type of program) it essentially just gets copied from one place to another without any meaningful translation. The decode operation is just hoop jumping to change the data type -- and the seemingly pointless nature of the operation further obfuscates the larger vision behind Python's text handling (for the uninitiated). Further, since the docs don't make the return type(s) explicit (by name), it's hard to even know where to start looking for the appropriate conversion function.