2

(I've read different posts and resources uncluding "The Absolute Minimum <...>" but still don't understand how to solve my issue)

I want to feed all files in my dir (and its subdirs) to xmllint tool. Some files have chinese characters in names.

#!/usr/bin/env python3

import os, sys
import subprocess

fn_folder = "d:/test"

fn_tool_path = 'd:/libxml2-2.9.3-win32-x86_64/bin/xmllint.exe '

for root, subFolders, files in os.walk(fn_folder):
    for eachfile in files:
        fullname = os.path.join(root,eachfile)
        full_cmd = fn_tool_path + '--format ' + fullname
        subprocess.Popen(full_cmd)

if, for example, in that d:\test folder I have 2 files: test1.xml and test2山.xml (chinese character after '2'), then first will be processed correctly while for the second one I'll get warning: failed to load external entity "file:/d:/test/test2%3F.xml" - i.e. "faulty" character was escaped before passing as an argument. How to avoid this?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • Sorry, I made a wrong assumption about what Python version you were running on. I've deleted my answer for now. – Martijn Pieters Apr 14 '16 at 07:05
  • 1
    It doesn't look like Python issue. Python 3 uses Unicode API to start a process on Windows -- there should be no issue to pass a file name with Chinese characters. What happens if you run the same `xmllint` command in the console? Unrelated: 1- `subprocess.Popen()` does not wait for the command to finish -- you may create too many processes. Here's [how to limit the number of concurrent processes](http://stackoverflow.com/a/14533902/4279) 2- you could use raw string literals for Windows paths e.g., `r'd:\test'`. – jfs Apr 14 '16 at 07:56
  • 1
    As @J.F.Sebastian noted, Python 3 isn't the problem. It's that xmllint.exe isn't compiled with Unicode support, so it creates the `argv` vector for `main` by parsing the ANSI command line from `GetCommandLineA`. That's why there's a "?" (%3F) in the filename. Most Windows codepages use a question mark as the replacement character. As a workaround, if 8.3 DOS names are enabled on the system, you could use ctypes to call `GetShortPathNameW`. Or you could make temporary copies of the files using random ASCII names. – Eryk Sun Apr 14 '16 at 09:10
  • @eryksun you could post your comment as an answer: it has the explanation (`GetCommandLineA()`) and a possible workaround (`GetShortPathNameW()`). – jfs Apr 14 '16 at 09:14

0 Answers0