6

How can I get subprocess.check_call to give me the raw binary output of a command, it seems to be encoding it incorrectly somewhere.

Details:

I have a command that returns text like this:

some output text “quote” ...

(Those quotes are unicode e2809d)

Here's how I'm calling the command:

f_output = SpooledTemporaryFile()
subprocess.check_call(cmd, shell=True, stdout=f_output)
f_output.seek(0)
output = f_output.read()

The problem is I get this:

>>> repr(output)
some output text ?quote? ...
>>> type(output)
<str>

(And if I call 'ord' the '?' I get 63.) I'm on Python 2.7 on Linux.

Note: Running the same code on OSX works correctly to me. The problem is when I run it on a Linux server.

Greg
  • 45,306
  • 89
  • 231
  • 297
  • Its possible the called program adjusts its output depending on what stdout is. How about opening a regular file and see what bytes are actually written. BTW, `SpooledTemporaryFile` is over kill. The "spooled" part only works for stuff written from python. When you got the file descriptor it changed it to a regular temporary file. The extra StringIO buffer wasn't used. – tdelaney May 16 '16 at 02:29
  • 1
    I wrote a quick python program that spits out the utf-8 string and your program worked for me. – tdelaney May 16 '16 at 02:29
  • Try running the command in a shell and redirect to a file. If you have `vim` installed you should also have `xxd`, which can display a file hex dump. In your example text, the utf-8 output should look like: ```0000000: 736f 6d65 206f 7574 7075 7420 7465 7874 some output text 0000010: 20e2 809c 7175 6f74 65e2 809d 202e 2e2e ...quote... ...``` The left quote is `e2 80 9c` and the right quote is `e2 80 9d` – Lex Scarisbrick May 16 '16 at 02:54
  • Another way to think of it is that since the python 2.7 file read an ascii `?`, it was in the file being read. So, the program didn't write the string you think it did. – tdelaney May 16 '16 at 03:00
  • @tdelaney, did you try it on osx? It actually works correctly for me on OSX. I'll update my question. It prints what I wrote about to the console when I run the command line directly. I can try having it redirect to a file but I don't know what that would show. – Greg May 16 '16 at 03:06
  • I tried redirecting to a file and it still outputs what I want. I don't have xdd but this finds the character in question: grep “ /tmp/test.txt – Greg May 16 '16 at 03:13
  • @tdelaney, I think you're right, it's probably not happening in the read step, it's probably when check_call is capturing stdout and writing it to that file. – Greg May 16 '16 at 03:14
  • Running the code `f_output = SpooledTemporaryFile();subprocess.check_call('echo -e \'some output text \\xe2\\x80\x9cquote\\xe2\\x80\\x9d ...\'', shell=True, stdout=f_output);f_output.seek(0);output=f_output.read();print(repr(output));` in python2 in linux gave me `'some output text \xe2\x80\x9cquote\xe2\x80\x9d ...\n'`. – v7d8dpo4 May 16 '16 at 04:02

2 Answers2

1

Wow, this was the weirdest issue ever but I've fixed it!

It turns out that the program it was calling (a java program) was returning different encoding depending on where it was called from!

Dev osx machine, returns the characters fine, Linux server from command line, returns them fine, called from a Django app, nope turns into "?"s.

To fix this I ended up adding this argument to the command:

-Dfile.encoding=utf-8

I got that idea here, and it seems to work. There's also a way to modify the Java program internally to do that.

Sorry I blamed Python! You guys had the right idea.

Community
  • 1
  • 1
Greg
  • 45,306
  • 89
  • 231
  • 297
  • have you tried to fix your locale settings (`locale.getpreferredencoding()`) as I've suggested in my answer (check them in the same context as the code that you want to run)? – jfs May 18 '16 at 20:01
0

The redirection (stdout=file) happens at the file descriptor level. Python has nothing to do with what is written to the file if you see ? instead of in the file itself (not in a REPL).

If it work on OS X and it "doesn't work" on Linux server then the likely reason is the difference in the environment, check LC_ALL, LC_CTYPE, LANG envvars—python, /bin/sh (due to shell=True), and the cmd may use your locale encoding that is ASCII if the environment is not set (C, POSIX locale).

To get "raw binary" from a subprocess:

#!/usr/bin/env python
import subprocess

raw_binary = subprocess.check_output(['cmd', 'arg 1', 'arg 2'])
print(repr(raw_binary))

Note:

  • no shell=True—don't use it unless it is necessary
  • many programs may change their behavior if they detect that the output is not a tty, example.
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670