subprocess call failing with unicode data from web form input - works with same input from elsewhere

Question

I have a library that contains a function that checks various input data against a number of regexps to ensure that the data is valid. The function is called both for input received by a CGI script from a web form (via lighttpd) and for input read from an sqlite database (which input is put there, in turn, based on SMS's received by gammu-smsd).

The input is at times in English and at times in Hindi, i.e. in Devnagari script. It should always be encoded with UTF-8. I have struggled with python's re and regex modules, which seem to be buggy when it comes to correctly matching character classes to Devnagari characters (you can see one example here - in that case using regex instead of re fixed the problem, but I've since had trouble with regex too). Command line 'grep' appears far more reliable and accurate. Hence, I've resorted to using a subprocess call to pipe the requisite strings to grep, as follows:

def invalidfield(datarecord,msgtype):
  for fieldname in datarecord:
    if (msgtype,fieldname) in mainconf["MSG_FORMAT"]:
        try:
            check = subprocess.check_output("echo '" + datarecord[fieldname] + "' | grep -E '" + mainconf["MSG_FORMAT"][msgtype,fieldname] + "'",shell=True)
        except subprocess.CalledProcessError:
            return fieldname
return None

Now, let's try this out with the following string as input: न्याज अहमद् and the following regex to check it : ^[[:alnum:] .]*[[:alnum:]][[:alnum:] .]*$

Oddly enough, exactly the same input, when read from the database, clears this regexp (as it should) and the function returns correctly. But when the same string is entered via the webform, subprocess.check_out fails with this error:

File "/usr/lib/python2.7/subprocess.py", line 537, in check_output
  process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
  errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1259, in _execute_child
  raise child_exception
TypeError: execv() arg 2 must contain only strings

I cannot figure out what is going on. I've modified my lighttpd.conf using this script which ought to, at least, ensure that lighttpd.conf is using the utf-8 charset. I've also used the chardet module and run chardet.detect on the input from the webform. I get this: {'confidence': 1.0, 'encoding': 'ascii'}{'confidence': 0.99, 'encoding': 'utf-8'}

In accordance with this answer I tried replacing datarecord[fieldname] in the above with unicode(datarecord[fieldname]).encode('utf8') and also with first trying to decode datarecord[fieldname] with the 'ascii' codec. The latter fails with the usual 'ordinal not in range' error.

What is going wrong here? I just can't figure it out!

And it would be `datarecord[fieldname].encode('utf8')` (no need to turn unicode to unicode). — Martijn Pieters, Mar 24 '13 at 11:46
As it is a CGI script I had to ask it to write out `str(repr(datarecord[fieldname]))` to a file. I get this: '\xe0\xa4\xa8\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\x9c \xe0\xa4\x85\xe0\xa4\xb9\xe0\xa4\xae\xe0\xa4\xa6\xe0\xa5\x8d' — ShankarG, Mar 24 '13 at 11:50
And using datarecord[fieldname].encode('utf8') also doesn't work... — ShankarG, Mar 24 '13 at 11:51
That is already-encoded UTF8 data for `न्याज अहमद्`; there should be no need to encode that. Calling `unicode()` on it will attempt to *decode* that to Unicode but as ASCII instead. — Martijn Pieters, Mar 24 '13 at 11:52
But why are you using `grep` to do a task you can do in Python instead? — Martijn Pieters, Mar 24 '13 at 11:54
If you *have* to use `grep -E`, why not write your data to the `stdin` pipe for the `grep` subprocess instead of using `echo` and `shell=True`. — Martijn Pieters, Mar 24 '13 at 12:02
Thanks for your prompt comments. As noted in the question, I'm using `grep` because `regex` seems buggy with Devnagari. For instance, regex fails when matching `सददीक अहमद्` against the same regexp listed above, when it ought to match. — ShankarG, Mar 24 '13 at 12:50
Try using the [`regex`](https://pypi.python.org/pypi/regex) library instead, it has *much* better Unicode support. — Martijn Pieters, Mar 24 '13 at 12:51
No need for apologies - your suggestion to use the `stdin` pipe seems to have worked! I still don't understand why the earlier one doesn't work, but this does. Can you put it in as an answer so I can accept it? — ShankarG, Mar 24 '13 at 15:11

Martijn Pieters · Accepted Answer · 2013-03-25T15:39:21.943

3

You want to avoid using echo in this case; write your input directly to the stdin pipe of the Popen() object instead.

Do make sure your environment is set to the correct locale so that grep knows to parse the input as UTF-8:

env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
p = subprocess.Popen(['grep', '-E', mainconf["MSG_FORMAT"][msgtype,fieldname]], stdin=subprocess.PIPE, env=env)
p.communicate(datarecord[fieldname])
if p.returncode:
     return fieldname

edited Mar 25 '13 at 15:39

answered Mar 24 '13 at 15:13

Martijn Pieters

1,048,767
296
4,058
3,343

I'm sorry, I was wrong - this does not actually work. I had a bug in my code (I was still checking for CalledProcessError, which was not being raised by Popen). When I rewrote the code I discovered two things - first there were problems with the file handles, and after that grep keeps failing to match even when it matches from the command line. Others seem to have reported encoding problems with Popen.communicate(see e.g. [here](http://stackoverflow.com/questions/3810302/python-unicode-popen-or-popen-error-reading-unicode), but no luck with that solution either for me. – ShankarG Mar 25 '13 at 14:16
How well do you understand unicode encodings and Python? Does the [Unicode HOWTO](http://docs.python.org/2/howto/unicode.html) help? You need to match the encoding to what `grep` expects, which is probably based on the `LC_` family of environment variables. – Martijn Pieters Mar 25 '13 at 14:20
I've read the docs but I confess that I'm still confused on unicode in python. I use `reload(sys); sys.setdefaultencoding('utf-8')` at the start of the program, along with `from __future__ import unicode_literals` in order to try to force a uniform encoding. I had numerous problems at the start but most of them appear to be fixed. This is the only one left. What is doubly odd is that it works fine with data from the sms database but not from the CGI form. And believe me, it is *very* confusing and irritating. :) – ShankarG Mar 25 '13 at 14:24
1

`setdefaultencoding()` should never be needed; that just indicates you are mixing bytes and unicode somewhere in ways that you should not do. – Martijn Pieters Mar 25 '13 at 14:25
Possibly. I used it to ensure that the output from the CGI script was appropriately encoded at all times (rather than manually enforcing utf8 on every output, since much of it is in Devnagari). I doubt it is the source of the present problem, though. – ShankarG Mar 25 '13 at 14:29
I should mention that if I simulate the operation by running Popen with `cat` and setting stdout to a file, and then running that file through grep, grep matches. Does this help with making sense of the problem? – ShankarG Mar 25 '13 at 14:34
@ShankarG: Presumably your output is written as UTF-8 then, but it could be that `grep` alters behaviour when grepping a pipe as opposed to a file. – Martijn Pieters Mar 25 '13 at 14:37
@ShankarG: Also see [Python - How do I pass a string into subprocess.Popen (using the stdin argument)?](http://stackoverflow.com/q/163542) for proper use providing of `stdin` data to a subprocess such as grep. – Martijn Pieters Mar 25 '13 at 14:44
Yes, that's how I'm using it, at least as far as I understand (no space for a code snippet in comments :) ). Incidentally, as a test, in python I tried using popen.communicate to write out the output (via `cat`) to a file, and then running subprocess.check_output to run that file through grep. Grep once again fails to match. However, it matches at the command line. Does this mean there's something different about the environment that affects grep's behaviour? – ShankarG Mar 25 '13 at 14:49
@ShankarG: perhaps, yes. Check what the `LC_*` locale env variables are set to. – Martijn Pieters Mar 25 '13 at 14:55
Yes, I had this idea too just now. I added "LC_ALL=en_US.UTF-8" to the environment and it seems to be working. Still testing. – ShankarG Mar 25 '13 at 15:21
Seems to have worked. Fingers crossed though that I'm not making a mistake like last time. Phew, what a journey - so many weird problems. Marking answer as accepted again as this was part of the solution. Will add an additional answer below with this point once I am absolutely sure. – ShankarG Mar 25 '13 at 15:29
Just saw that -now hope there is not some other mystery at work :). Meanwhile, edited your answer to add in the final test (which is different in this case) so that anyone looking at this answer will get the full idea... OOPS- apparently that's hidden pending "peer review". – ShankarG Mar 25 '13 at 15:38
Added one more edit, dealing with empty strings (which is another bug that this code resulted in). – ShankarG Mar 25 '13 at 16:11
@ShankarG: My answer does not have to reflect *every detail* your solution needs. It only needs to illustrate how to solve your immediate problem (e.g. unicode problems and grep -E) – Martijn Pieters Mar 25 '13 at 16:12
True, but the solution that is suggested will directly result in this bug and in that sense will fail to reproduce the behaviour of the original function. As it's your answer, won't tamper with it further, am adding one separately below just so that others who have this problem don't have to stumble on it the hard way. – ShankarG Mar 26 '13 at 04:56

score 0 · Answer 2 · answered Mar 26 '13 at 05:09

Just to add to Martijn Pieters' answer, the solution he has suggested will fail in the case of an empty input string (unlike the original function, grep will fail to match an empty string even if the regexp would have permitted it). Therefore a complete implementation of the original function would be:

if (msgtype,fieldname) in mainconf["MSG_FORMAT"]:
        if not datarecord[fieldname]:
            if not regex.search(mainconf["MSG_FORMAT"][msgtype,fieldname],datarecord[fieldname],regex.UNICODE):
                return fieldname
        else:               
            curenv = os.environ
            curenv['LC_ALL']="en_US.UTF-8"
            check = subprocess.Popen(['grep','-E', mainconf["MSG_FORMAT"][msgtype,fieldname]], stdin=subprocess.PIPE, env=curenv, stderr=subprocess.STDOUT,stdout=subprocess.PIPE)
            check.communicate (datarecord[fieldname]) 
            if check.returncode:
                return fieldname
return None

This works as regex matching works fine on an empty string.

subprocess call failing with unicode data from web form input - works with same input from elsewhere

2 Answers2