1

Here's my code:

#! /usr/bin/env python3
import subprocess
a = subprocess.check_output('echo -n "hello world!"',shell=True)
print("a="+str(a))

output:

a=b'hello world!'

If I include the argument universal_newlines=True in the call to check_output, then I get the desired output:

a=hello world!

For the sake of better understanding the mysterious world of programming with text in the modern (Unicode) age, I would like to know how to generate the second output without specifying universal_newlines=True. In other words, what function do I call to convert a so that it will produce the desired output.

A working example would go a long way. Detailed explanations are nice, but they tend to be a bit confusing for the uninitiated -- maybe due to the use of overloaded terminology, maybe because of differences between Python2 and Python3, or maybe just because I very rarely need to think about text encoding in my line of work -- most of the tools that I work with don't require special handling like this.

Also: I believe the first output is of type bytes, but what is the type of the second output? My guess is str with UTF-8 encoding.

Kevin
  • 74,910
  • 12
  • 133
  • 166
Brent Bradburn
  • 51,587
  • 17
  • 154
  • 173
  • Have you tried decoding the output? – Ignacio Vazquez-Abrams Feb 18 '15 at 17:51
  • @IgnacioVazquez-Abrams: Sure, I tried to figure that out, but my first few guesses didn't pan out. I'm hoping someone can show me how to do that. What is the syntax? What are the data types involved? etc. I'm sure this is very easy for people who already know how to do it. Hopefully, I will soon be one of those people. :) – Brent Bradburn Feb 18 '15 at 17:55
  • Now that I know what everything is called, I was able to find [the dup](http://stackoverflow.com/questions/606191/convert-bytes-to-a-python-string). Based on the number of hits that question has generated, I think is is fair to say that the documentation for the subprocess module could stand to provide a few more usage hints in order to be easier for the casual Python user. – Brent Bradburn Feb 20 '15 at 04:14

3 Answers3

2

As originally implied by Ignacio's comment, you could use decode:

>>> a = b"hello world!"
>>> print("a="+str(a))
a=b'hello world!'
>>> print("a="+a.decode())
a=hello world!
Kevin
  • 74,910
  • 12
  • 133
  • 166
2

From subprocess.check_output() docs:

By default, this function will return the data as encoded bytes. The actual encoding of the output data may depend on the command being invoked, so the decoding to text will often need to be handled at the application level.

This behaviour may be overridden by setting universal_newlines to True as described below in Frequently Used Arguments.

If you follow the link to Frequently Used Arguments; it describes what universal_newlines=True does:

If universal_newlines is False the file objects stdin, stdout and stderr will be opened as binary streams, and no line ending conversion is done.

If universal_newlines is True, these file objects will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False). For stdin, line ending characters '\n' in the input will be converted to the default line separator os.linesep. For stdout and stderr, all line endings in the output will be converted to '\n'. For more information see the documentation of the io.TextIOWrapper class when the newline argument to its constructor is None.

For more details you could look at io.TextIOWrapper() documentation.

To run your echo -n "hello world!" shell command and to return text without check_output() and without using universal_newlines=True:

#!/usr/bin/env python
import locale
from subprocess import Popen, PIPE

charset = locale.getpreferredencoding(False)
with Popen(['echo', 'Hello world!'], stdout=PIPE) as process:
    output = process.communicate()[0].decode(charset).strip()

Here's a couple of code examples that show how subprocess pipes and TextIOWrapper class could be used together.

To understand what is text and what is binary data in Python, read Unicode HOWTO. Here's the most important part: there are two major string types in Python: bytestrings (a sequence of bytes) that represent binary data and Unicode strings (a sequence of Unicode codepoints) that represent human-readable text. It is simple to convert one into another (☯):

unicode_text = bytestring.decode(character_encoding)
bytestring = unicode_text.encode(character_encoding)
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • These docs were the first thing that I read (that's how I guessed to try `universal_newlines=True`). But, frankly, a lot of this was over my head initially because I didn't know how all of the terminology translated into data types (which I now know can be either `bytes` or `str`) and function calls (in particular it doesn't tell me that I should probably just call `decode()`, even though the data isn't really encoded). – Brent Bradburn Feb 19 '15 at 18:46
  • ...Also, the description (and name) of `universal_newlines` doesn't directly match anything in my prior experience. It seems to be describing what I would normally think of as text vs. binary I/O modes, but this is normally a NOP on Linux systems so I didn't find it too meaningful. What I don't see in these docs is clear documentation of the fact that `universal_newlines=True` changes the return type from `bytes` to `str`. ...But there's a lot of good data captured in this answer -- thanks! – Brent Bradburn Feb 19 '15 at 18:48
  • @nobar: I'm (almost) sure subprocess' docs, don't tell you to call `.strip()` method if you want to remove the leading/trailing whitespace from the output. `subprocess` module uses `bytes`, `str` types but teaching about `bytes.decode()` is not its responsibility. The phrase *text streams* implies that the result is `str` (the last paragraph is a common knowledge in Python -- you shouldn't expect `subprocess` to teach you that). Yes, `universal_newlines` does not suggests 'text mode' for me too -- it seems like a good compromise if you need to write single-source Python 2/3 compatible code. – jfs Feb 20 '15 at 00:48
0

Also: I believe the first output is of type bytes, but what is the type of the second output? My guess is str with UTF-8 encoding.

Close, but not quite right. In Python3 the str type is indexed by Unicode code points (note that code points usually, but not always, have a 1:1 correspondence with user perceived characters). Therefore, the underlying encoding is abstracted away when using the str type -- consider it unencoded, even though that is fundamentally not the case. It is the bytes type which is indexed as a simple array of bytes and therefore must use a particular encoding, In this case (as in most similar usages), ASCII would suffice for decoding what was generated by the subprocess script.

Python2 has different defaults for the interpretation of the str type (see here), so string literals will be represented differently in that version of the language (this difference could be a big stumbling block when researching text handling).

As a person who mostly uses C++, I found the following to be extremely enlightening about the practical storage, encoding, and indexing of Unicode text: How do I use 3 and 4-byte Unicode characters with standard C++ strings?


So the answer to the first part of the question is bytes.decode():

a = a.decode('ascii') ## convert from `bytes` to 'str' type

although simply using

a = a.decode() ## assumes UTF-8 encoding

will typically produce the same results since ASCII is a subset of UTF-8.

Alternatively, you can use str() like so:

a = str(a,encoding='ascii')

but note that here an encoding must be specified if you want the "contents only" representation -- otherwise it will actually build a str type which internally contains the quote characters (including the 'b' prefix), which is exactly what was happening in the first output shown in the question.


subprocess.check_output processes the data in binary mode (returning the raw byte sequence) by default, but the cryptic argument universal_newlines=True basically tells it to decode the string and represent it as text (using the str type). This conversion to the str type is necessary (in Python3) if you want to display the output (and "only the contents") using Python's print function.

The funny thing about this conversion is that, for these purposes, it really doesn't do anything to the data. What happens under the hood is an implementation detail, but if the data is ASCII (as is very typical for this type of program) it essentially just gets copied from one place to another without any meaningful translation. The decode operation is just hoop jumping to change the data type -- and the seemingly pointless nature of the operation further obfuscates the larger vision behind Python's text handling (for the uninitiated). Further, since the docs don't make the return type(s) explicit (by name), it's hard to even know where to start looking for the appropriate conversion function.

Community
  • 1
  • 1
Brent Bradburn
  • 51,587
  • 17
  • 154
  • 173
  • Show the type of an object: `print("a=["+str(a)+"], type="+str(type(a)))` – Brent Bradburn Feb 19 '15 at 16:24
  • I probably would have had no problems, and never asked this question, if they had simply made `universal_newlines=True` the default. If your subprocess is returning non-ASCII Unicode (a rare case in my experience), then you would be happy to deal with this conversion process. If your subprocess is returning non-text binary, then a binary return mode would be good, but maybe it should be named as such. – Brent Bradburn Feb 19 '15 at 17:19
  • On the plus side, I can finally say that I have actually programmed with Unicode -- even though it was only used for the [ASCII subset](http://stackoverflow.com/questions/19212306/difference-between-ascii-and-unicode) and served no real purpose. – Brent Bradburn Feb 19 '15 at 17:22
  • Note to self: Think twice before trying to work in Python3. [It is a black hole of wasted time](http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/). Still makes a fine calculator though. – Brent Bradburn Jul 18 '15 at 01:46