1
$ ./a.py b.xml

This is ok. a.py reads files and prints something.

a.py reads arguments as in

# Each argument is a file
args = sys.argv[1:]

# Loop on files
for filename in args :

    # Open the file
    file = open(filename)

I want to pipe the out to other scripts.

$ ./a.py b.xml | grep '1)'

This gives python error.


This also fails

$ x=$(./a.py b.xml); echo $x...

How to tell python not to interpret shell script syntax such as | $() `` ?


The error is

Traceback (most recent call last):
  File "./flattenXml.py", line 135, in <module>
    process(file, prefix)
  File "./flattenXml.py", line 116, in process
    linearize(root, prefix + "//" + removeNS(root.tag))
  File "./flattenXml.py", line 104, in linearize
    linearize(childEl, path + '/' + numberedTag)
  File "./flattenXml.py", line 104, in linearize
    linearize(childEl, path + '/' + numberedTag)
  File "./flattenXml.py", line 104, in linearize
    linearize(childEl, path + '/' + numberedTag)
  File "./flattenXml.py", line 104, in linearize
    linearize(childEl, path + '/' + numberedTag)
  File "./flattenXml.py", line 104, in linearize
    linearize(childEl, path + '/' + numberedTag)
  File "./flattenXml.py", line 104, in linearize
    linearize(childEl, path + '/' + numberedTag)
  File "./flattenXml.py", line 83, in linearize
    print path + "/@" + removeNS(name) + "=" + val
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 106: ordinal not in range(128)

The python script is from Python recipes.

Ksthawma
  • 1,257
  • 1
  • 16
  • 28
  • 1
    Please post the *exact* error you see. – Stefano Sanfilippo Oct 03 '13 at 19:53
  • 5
    Python never sees the shell syntax, the shell processes it transparently to the program. What error are you getting? – Barmar Oct 03 '13 at 19:53
  • We cannot fix an approximation by guesses, we need the **exact** error message. The script itself seems fine to me. – Stefano Sanfilippo Oct 03 '13 at 19:57
  • You're still not saying what "fail" means. The above works, except you not using "$x" in the echo. – jhermann Oct 03 '13 at 19:57
  • *Please*, copy the error you see in the console and paste it here. What does "fails" mean, exactly? We need the error message that gets printed in your console, if any. – Stefano Sanfilippo Oct 03 '13 at 19:58
  • Copied. Sorry I'm new to stackoverflow and was figuring whether to put the actuall error in comment or edit the original post. – Ksthawma Oct 03 '13 at 20:00
  • The shell is ok, we need to inspect `flattenXml.py`. At least the `linearize` function, around line 83 of that file. – Stefano Sanfilippo Oct 03 '13 at 20:00
  • Thanks Stefano Sanfilippo. I added the link to the actual python script. Well but line numbers are not the same, because I added comments after saving from the recipe. – Ksthawma Oct 03 '13 at 20:04
  • When you run this on a terminal, the output is utf-8 but when you pipe to another program, python assumes you want ascii - hence the encode error. I kinda anwsered this in [Redirecting python's stdout to the file fails with UnicodeEncodeError](http://stackoverflow.com/questions/19145183/redirecting-pythons-stdout-to-the-file-fails-with-unicodeencodeerror) but I'd like to know a better anwser! – tdelaney Oct 03 '13 at 20:06
  • I don't understand - the pipe is interpreted in python or in shell? Can I force python ends within [./a.py b.xml] and hands off? – Ksthawma Oct 03 '13 at 20:08
  • @KaFaiLo - The write end of the pipe is python's sys.stdout and python assumes its ascii. I have a short script in the post that demonstrates the problem. Pipe it to 'cat' instead of redirecting to a file and you should see an encoding error. `$ ./testscript.py | cat` should do it. – tdelaney Oct 03 '13 at 20:11
  • @KaFaiLo - I believe that grep handles utf-8 (not sure!), hopefully flattenXml.py can set a utf-8 output encoding and then everything should magically work. – tdelaney Oct 03 '13 at 20:15
  • @tdelaney - Thank you so much. I found that the cause is in the XML file, which has UTF-8 characters. The method .encode(encoding) you mentioned in the other post works.
    And | cat works and it is so easy, requiring no .encode or any changes in the python script.
    – Ksthawma Oct 03 '13 at 20:27
  • @tdelaney - Just want to know more about the mechanism. "The write end of the pipe is python's sys.stdout", so there is no way to end python before the end of the command? Am I right to think that Python takes the rest of the command line and runs all within it? – Ksthawma Oct 03 '13 at 20:34
  • Sorry, my fault. | cat is not a fix. .encode(encoding) is tested to be the fix. – Ksthawma Oct 03 '13 at 20:40
  • @KaFaiLo - in your case, the shell glues python's stdout to grep's stdin and lets 'em rip. Python gets its command line (everything up to the | symbol - the shell split the command here and gives the rest to grep) and runs full bore. Python writes stdout, blocks any time the pipe gets full, and finally exits. grep keeps reading stdin until it gets EOF and exits. – tdelaney Oct 03 '13 at 20:40

1 Answers1

1

The problem is that your document has non-ascii characters that can't be printed to an ascii output stream.

Internally, python can handle any unicode character but when that character is serialized, python needs to know which representation to use (utf-8, utf-16 or any of a zillion international character encodings) so that it can write the correct bits.

When run in a console, python can get the terminal's encoding (mine happens to be en_US.UTF-8) and setup an encoder for sys.stdout properly. When piping stdout to another program or redirecting stdout to a file, python doesn't know what to do and defaults to setting the ascii encoder for sys.stdout.

when run in a console the encoder usually knows how to convert the character to the right bits for your terminal and you get a nice display. When piped, the ascii encoder can't handle the character and throws an error.

One solution is to encode everything to utf-8 before writing to stdout.

import sys
encoding = sys.stdout.encoding or 'utf-8'

...
print (path + "/@" + removeNS(name) + "=" + val).encode(encoding)

Here, the utf-8 encoder sends a string that will pass through the still-existing ascii encoder on sys.stdout and make it to the other side. Its an open question whether the program on the other side can handle utf-8.

tdelaney
  • 73,364
  • 6
  • 83
  • 116