2

I'm trying to retrieve the content of a zipped archive with python2.7 on 64bit windows vista. I tried by making a system call to 7zip (my favourite archive manager) using the subprocess module:

# -*- coding: utf-8 -*-
import sys, os, subprocess

Extractor = r'C:\Program Files\7-Zip\7z.exe'
ArchiveName = r'C:\temp\bla.zip'
output = subprocess.Popen([Extractor,'l','-slt',ArchiveName],stdout=subprocess.PIPE).stdout.read()

This works fine as long as the archive content contains only ascii filenames, but when I try it with non-ascii I get an encoded output string variable where ä, ë, ö, ü have been replaced by \x84, \x89, \x94, \x81 (etcetera). I've tried all kinds of decode/encode calls but I'm just too inexperienced with python (and generally too stupid) to reproduce the original characters with umlaut (which is required if I would like to follow-up this step with e.g. an extraction subprocess call to 7z).

Simply put my question is: How do I get this to work also for archives with non-ascii content?

... or to put it in a more convoluted way: Is the output of subprocess always of a fixed encoding or not? In the former case -> Which encoding is it? In the latter case -> How can I control or uncover the encoding of the output of subprocess? Inspired by similar questions on this blog I've tried adding

import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

and I've also tried

my_env = os.environ
my_env['PYTHONIOENCODING'] = 'utf-8'
output = subprocess.Popen([Extractor,'l','-slt',ArchiveName],stdout=subprocess.PIPE,env=my_env).stdout.read()

but neither seems to alter the encoding of the output variable (or to reproduce the umlaut).

jfs
  • 399,953
  • 195
  • 994
  • 1,670
commuter
  • 36
  • 4
  • reading subprocess pipes on Python 2 returns a bytestring -- a sequence of bytes -- no encoding. So the question is what character encoding does 7z use to print the output and whether the output is corrupted somehow between 7z and the parent python process (on Linux each byte is passed as is but [on Windows there could be some implicit conversion](http://stackoverflow.com/q/28943943/4279)). What is clear is that changing the encoding of `sys.stdout` or `PYTHONIOENCODING` has no effect here. – jfs Mar 10 '15 at 15:46
  • unrelated: use `subprocess.check_output()` to get the output from a command. – jfs Mar 10 '15 at 15:48
  • J.F. Sebastian, thanks for the reaction! Actually my issue persists beyond 7z. To test I created the file ‘blä.txt’. 'dir' from the command line prints 'blä.txt', but with `subprocess.Popen(['dir',folder], … )` or `check_output` in python2.7 on 64bit Windows 7 & Vista I always get ‘bl\x84.txt’. The encoding of `sys.stdout` or `PYTHONIOENCODING` does not influence my output in any way … What am I doing wrong? Is this caused by the implicit windows conversion you mentioned (and how could I test that)? – commuter Mar 12 '15 at 09:19
  • … In fact `sys.stdout = codecs.getwriter(‘utf8’)(sys.stdout)` does not seem to take effect. Even after that command `print sys.stdout.encoding` returns `None`. Also my `sys.stdout` does not have an `isatty` method … Could the `subprocess` problem be caused by an underlying issue with my `sys` module? – commuter Mar 12 '15 at 09:21
  • Reassigning `sys.stdout` or `PYTHONIOENCODING` has no effect (as it should) on the character encoding used by `7z`, `dir` commands. Anyway you should not touch `sys.stdout` `PYTHONIOENCODING` *inside* your Python script (configure your environment (outside) and print Unicode directly instead). 7z and dir are separate issues. There could be several character encodings here: 1. character encoding that might be used inside zip-archive (if any) 2. console oem cp 3. mbcs (windows) encoding. Also a program may use Unicode API to print to Windows console and use arbitrary encoding if it is redirected – jfs Mar 12 '15 at 15:25
  • to improve your understanding, you could ask about `dir` of a directory with non-ascii names in a separate question. To find out the encoding, you could get the filenames as Unicode: `filenames = os.listdir(u'.')` and try the likely candidates: `filename[0].encode('mbcs')` or `.encode(whatever_chcp_command_returns)` and compare the bytes with the result of `check_output('dir', shell=True)`. Make sure you understand the difference between: `print(s)` and `print(repr(s))`. – jfs Mar 12 '15 at 15:28
  • 1
    Wow, this is way, way more complicated than I expected … Thanks for your patient explanations, JF Sebastian! (BTW, adding a –sccUTF-8 flag to 7z has helped quite a bit with that application; ‘dir’ I still do not fully understand, but for now I can live with my own ignorance on that part :O ) – commuter Mar 13 '15 at 09:19
  • you could compare 7z results with the corresponding zipfile module results. zipfile decodes using [either cp437 or utf8](http://stackoverflow.com/q/13261347/4279) – jfs Mar 13 '15 at 10:34
  • zipfile decodes on Python 3. It returns bytes as is on Python 2. – jfs Mar 13 '15 at 10:42

1 Answers1

0

You can try using the -sccUTF-8 switch from 7zip to force the output in utf-8. Here is ref page: http://en.helpdoc-online.com/7-zip_9.20/source/cmdline/switches/scc.htm

Bipin
  • 11
  • 1