Problems with subprocess, encoding and logging with sqlite

Question

I have searched for quite a while for the answer to this question and I think a lot of it has to do with my unfamiliarity with how the subprocess module works. This is for a fuzzing program if anyone is interested. Also, I should mention that this is all being done in Linux (I think that is pertinent) I have some code like this:

# open and run a process and log get return code and stderr information
process = subprocess.Popen([app, file_name], stdout=subprocess.PIPE,
                                             stderr=subprocess.PIPE)
return_code = process.wait()
err_msg = process.communicate()[1]

# insert results into an sqlite database log
log_cur.execute('''INSERT INTO log (return_code, error_msg) 
                   VALUES (?,?)''', [unicode(return_code), unicode(error_msg)])
log_db.commit()

99 out of 100 times it works just fine but occasionally i get an error similar to:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xce in position 43: invalid continuation byte

EDIT: Full-trace

Traceback (most recent call last):
  File "openscadfuzzer.py", line 72, in <module>
    VALUES (?,?)''', [crashed, err_msg.decode('utf-8')])
  File "/home/username/workspace/GeneralPythonEnv/openscadfuzzer/lib/python2.7/encodings/utf_8.py",    line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdb in position 881: invalid continuation byte

Is this a problem with subprocess, the application that I am using it to run or my code? Any pointers would be appreciated (especially when it pertains to the correct usage of subprocess stdout and stderr).

Not a solution. Just an observation - `wait()` and `communicate()` will both wait for the process to finish. You can drop the `wait()` call and do something like - `(out,err,) = process.communicate() return_code=process.returncode` — RedBaron, Apr 22 '13 at 08:00
Any chance you could give the stack trace that shows where the UnicodeDecodeError happens? — monk, Apr 22 '13 at 13:01
It's also probably worth pointing out the python version used, as the automatic decode/encode semantics differ between 2 and 3 (and the subprocess libraries might do too) — monk, Apr 22 '13 at 13:02
Sorry, it is in the tags but I guess I should mention that I am using Python 2.7 — Daniel Kuntz, Apr 22 '13 at 14:32
http://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte — RedBaron, Apr 22 '13 at 14:55

score 2 · Accepted Answer · answered Apr 22 '13 at 13:32

My guess is that the problem is this call:

unicode(error_msg)

What is the type of error_msg? I'm fairly sure by default the subprocess APIs will return the raw bytes output by the child program, the call to unicode tries to convert the bytes into characters (code points), by assuming some encoding (in this case utf8).

My guess is that the bytes aren't valid utf8, but are valid latin1. You can specify what codec to convert between bytes and characters:

error_msg.decode('latin1')

Here's an example that hopefully demonstrates the problem and workaround:

>>> b'h\xcello'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.2/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 1: invalid continuation byte

>>> b'h\xcello'.decode('latin1')
'hÎllo'

A better solution might be to make your child process output utf8, but then that depends on what data your database is capable of storing also.

I have had trouble with latin1 encoding in the past on a windows computer, is it used frequently in a Linux environment? This is a popular open-source program I am fuzzing and I don't think it would have been ported from Windows, but I could be wrong. I will try your suggestion. — Daniel Kuntz, Apr 22 '13 at 14:51
$LANG: en_US.UTF-8. I am pretty sure I figured it out. I am not checking that my fuzzing input is valid utf-8 at any point. Seeing as I am modifying human readable text. openscad is just sticking any non utf-8 that the fuzzing creates into the error stream when it thinks it is reporting invalid code. So, I guess this counts as a bug in the code? — Daniel Kuntz, Apr 23 '13 at 17:17

score 1 · Answer 2 · answered Apr 22 '13 at 07:33

You can find very good Subprocess tutorial here http://pymotw.com/2/subprocess/ and its official documentation here: http://docs.python.org/2/library/subprocess.html, but from how the error you're getting is formatted, it seems it is not your code, but your application that gets the error, and you're only seeing it, because you're collecting the output. To confirm that, you can run your app outside your code, using a simple bash loop, to see if you can catch the error again and in your code, check the exit code of the application - when you see the error it should be different than 0, if the application correctly provides exit codes.

Thanks, I will try that later today and report back on my results. This is what I suspected was happening. I guess I will just catch the error and make up my own to put in the database when it happens like: "ERROR: program crashed with indecipherable error message" haha. You'll probably get the answer but I want to check first. — Daniel Kuntz, Apr 22 '13 at 14:44

Problems with subprocess, encoding and logging with sqlite

2 Answers2