Convert raw byte string to Unicode without knowing the codepage beforehand

Question

When using the right-click menu context, windows passes file path as raw (byte) string type.

For example:

path = 'C:\\MyDir\\\x99\x8c\x85\x8d.mp3'

Many external packages in my application are expecting unicode type strings, so I have to convert it into unicode.

That would be easy if we'd known the raw string's encoding beforehand (In the example, it is cp1255). However I can't know which encoding will be used locally on each computer around the world.

How can I convert the string into unicode? Perhaps using win32api is needed?

ASCII is a codec that can always be converted to Unicode; just `.decode('ASCII')`. What you mean is not ASCII but encoded bytes. — Martijn Pieters, May 09 '13 at 19:14
You can't in general decode a string to unicode without knowing the encoding. Your text is not ASCII but some unknown encoding. — BrenBarn, May 09 '13 at 19:15
Your example is *not* cp1255, decoding using that codec fails. It is not UTF-16 either, which is suprising as that is what Windows uses internally for filenames. — Martijn Pieters, May 09 '13 at 19:17
Does `import locale; path.decode(locale.getpreferredencoding())` work perhaps? — Martijn Pieters, May 09 '13 at 19:19
Judging by your assertion that this is `cp1255` (hebrew), I tried out a few other codecs; this is actuall `cp856` or `cp862`; both decode the bytes you gave as: `C:\MyDir\שלום.mp3` — Martijn Pieters, May 09 '13 at 19:24
@MartijnPieters That's interesting. `locale.getpreferredencoding()` returns `cp1255` on my system. — iTayb, May 09 '13 at 19:24
@BrenBarn That's right, it's impossible to decode it without the encoding, but windows should know the right encoding. — iTayb, May 09 '13 at 19:26
@iTayb: If you have cp856 files on a cp1255 system, and no out-of-band way of knowing that they're cp856, there's really nothing you can do. I'm guessing Explorer and the DOS prompt don't show them properly either. For special cases (e.g., everything's probably utf-8, cp1255, or cp856), you can write some "guessing code" that tries various heuristics and usually guesses the charset, but that's the best you could do. — abarnert, May 09 '13 at 19:26
@iTayb: How do you expect Windows to know the encoding? If your system is set to cp1255, Windows expects all of your files to be cp1255. Unless… I believe it's possible to configure a non-default codepage for each drive or remote share somehow. Could you have a cp856 C: drive on a cp1255 system or something? — abarnert, May 09 '13 at 19:28
since you don't seem to have a better alternative, maybe use chardet to guess the encoding for you? — cmd, May 09 '13 at 19:28
@abarnert Windows itself shows it properly (through `explorer.exe`), so windows somehow knows how to handle it. — iTayb, May 09 '13 at 19:30
what does `path.decode("mbcs")` (ansi codepage) produce? Note: Windows uses Unicode filenames internally; look for a way to get Unicode strings directly instead of decoding bytes. — jfs, May 09 '13 at 20:08

score 3 · Accepted Answer · answered May 09 '13 at 22:54

No idea why you might be getting the DOS code page (862) instead of ANSI (1255) - how is the right-click option set up?

Either way - if you need to accept any arbitrary Unicode character in your arguments you can't do it from Python 2's sys.argv. This list is populated from the bytes returned by the non-Unicode version of the Win32 API (GetCommandLineA), and that encoding is never Unicode-safe.

Many other languages including Java and Ruby are in the same boat; the limitation comes from the Microsoft C runtime's implementations of the C standard library functions. To fix it, one would call the Unicode version (GetCommandLineW) on Windows instead of relying on the cross-platform standard library. Python 3 does this.

In the meantime for Python 2, you can do it by calling GetCommandLineW yourself but it's not especially pretty. You can also use CommandLineToArgvW if you want Windows-style parameter splittng. You can do this with win32 extensions or also just plain ctypes.

Example (though the step of encoding the Unicode string back to UTF-8 bytes is best skipped).

score 2 · Answer 2 · edited May 23 '17 at 11:46

Usually I use own util function for safe conversion from usual codepages to unicode. For reading default OS encoding probably locale.getpreferredencoding function could help (http://docs.python.org/2/library/locale.html#locale.getpreferredencoding).

Example of util function that tries to converting to unicode by iterating some predefined encodings:

# coding: utf-8
def to_unicode(s):
    if isinstance(s, unicode): return s

    from locale import getpreferredencoding
    for cp in (getpreferredencoding(), "cp1255", "cp1250"):
        try:
            return unicode(s, cp)
        except UnicodeDecodeError:
            pass
    raise Exception("Conversion to unicode failed")
    # or fallback like:
    # return unicode(s, getpreferredencoding(), "replace")

print (to_unicode("addđšđčćžŽŠĐ"))

Fallback could be enabled by using unicode function argument errors="replace". Reference http://docs.python.org/2/library/functions.html#unicode

For converting back to some codepage you can check this.

Convert raw byte string to Unicode without knowing the codepage beforehand

2 Answers2

Linked