10

In my case I embedded Python into my application. When the path of my application contains a non-latin-1 character Py_Initialize calls exit(1) internally (more information later).

So I checked if can reproduce this with the standard interpreter executable.

Python-2.7.x on Windows doesn't seem to work when the path of PYTHONHOME contains a character outside of latin-1 charset. The problem is that the module site could not be found and imported. Since umlauts seem to work, what is the actual limitation here? Is just latin-1 supported? Why does it work on OSX then?

C:\Users\ъ\Python27\python.exe    // fails to start (KOI8-R)
         ^
C:\Users\ġ\Python27\python.exe    // fails to start (latin-3)
         ^
C:\Users\ä\Python27\python.exe    // works fine (latin-1)
         ^

Any ideas?

Background:

I haven't stepped through the code yet but Python 2.6 and Python 2.7 also behave differently when site is not available. Py 2.6 just prints a message, Py 2.7 rejects to start.

static void
initsite(void)
{
    PyObject *m;
    m = PyImport_ImportModule("site");
    if (m == NULL) {
        ...

        // Python 2.7 and later
        exit(1);

        // Python 2.6 and prior
        PyFile_WriteString("'import site' failed; traceback:\n", f);
    }
    ...
}

Python 2.7: https://github.com/enthought/Python-2.7.3/blob/master/Python/pythonrun.c#L725

Python 2.6: https://github.com/python-git/python/blob/master/Python/pythonrun.c#L705

HelloWorld
  • 2,392
  • 3
  • 31
  • 68
  • 1
    Have you tried using Python 3 instead? They redid the Unicode handling, and it's much cleaner. My recommendation is actually to use 3 whenever you can, and 2 only if you have to. – A. L. Flanagan May 25 '16 at 20:45
  • In Python 3 it (should) work/s, yes. I have to stick with Python 2 because this is the version we embedded in our software, this will change in the future though. – HelloWorld May 25 '16 at 21:27
  • Can you elaborate on how you "embed" Python in your app? calling it from C/C++ ? what is the mechanism you use? And do you set the PYTHONHOME? if so how do you set it? As a side note the behaviour of OS FS wrt to unicode paths varies quite a bit on Windows, Mac and Linux/POSIX. And the way to deal with this in CPython 2 needs a bit of fiddling at times... Though I did wrestle with it a few times successfully – Philippe Ombredanne May 26 '16 at 10:15
  • Using *Py_Initialize, ...* from the C API. I tried *PYTHOMHOME* and the corresponding C functions (*Py_SetPath*, *Py_SetPythonhome*, ...) with no success. Btw, Python 2.7 (without being embedded) doesn't work either if installed at the given paths. – HelloWorld May 26 '16 at 15:31
  • MS Windows differs from OS X in that the fundamental character set is UTF-16 there. For backward code, it also provides an "ANSI" API, which uses single byte strings but which isn't able to represent the whole Unicode range. I'm pretty sure Python 2 will never be upgraded to use the fully Unicode-capable win32 API, so any hassle is futile unless you at least upgrade to Python 3. – Ulrich Eckhardt May 27 '16 at 06:46

2 Answers2

2

I think that the problem is that internally, Python2 processes everything as byte strings in the platform system encoding which is (in western europe) CP1252 a variant of Latin-1. So ther is no surprise that it cannot correctly process a PYTHONHOME path containing other characters

But, when I was younger, I was used to the good old 8.3 format of MS/DOS files...

I can still see (and use them) in a Windows 7 box with DIR /X in a console (CMD.EXE) window. This format only use ASCII uppercase characters and tilda (~), so it could be used as a workaround : just declare the 8.3 path in the environment variable PYTHONHOME, and start python with that 8.3 path.

BTW, it is advisable for PYTHONHOME to use a path that contains neither special characters, nore spaces. It could work, but it could cause problems with other modules

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • 1
    I would totally agree if it would work on a russian Windows because they have the corresponding system codepage (guess its CP125**1**). But there it fails as well. – HelloWorld May 19 '16 at 08:29
  • Just for completeness: the console codepage is 866 for a russian OS – HelloWorld May 19 '16 at 11:19
  • If 8.3 names are missing, check whether they're disabled: `fsutil behavior query Disable8dot3 C:`. Note that enabling 8.3 names will only affect new files subsequently created, not existing files. You could also try using `mklink` to create an ASCII-only hard link, symbolic link, or junction. – Eryk Sun May 20 '16 at 22:56
2

Looking at the PyImport_ImportModule function version 2.7 gives this definition:

PyObject *
PyImport_ImportModule(const char *name)
{
    PyObject *pname;
    PyObject *result;

    pname = PyString_FromString(name);
    if (pname == NULL)
        return NULL;
    result = PyImport_Import(pname);
    Py_DECREF(pname);
    return result;
}

While looking at the PyImport_ImportModule function version 3.5 gives the same except with

pname = PyUnicode_FromString(name);

instead of

pname = PyString_FromString(name);

You can look at the code for PyString_FromString and the code for PyUnicode_FromString but it seems clear that python 2 does not use unicode and python 3 does, but I have not been able to find how/where exactly this leads to the behavior you describe.

The PyImport_Import(module_name) function (version 2.7) only uses module_name like so:

r = PyObject_CallFunction(import, "OOOOi", module_name, globals,
                          globals, silly_list, 0, NULL);

passing on the responsibility...

hkBst
  • 2,818
  • 10
  • 29
  • Just some background FYI: Python 2 does Unicode, but the Unicode handling was completely redone for Python 3. Python 2 used a "best guess" method of decoding, and if it guessed wrong, all hell would break loose. Python 3 treats Uncode strings as strings, and encoded Unicode as byte arrays, forcing you to explicitly handle conversion if necessary. – A. L. Flanagan May 25 '16 at 20:43
  • I expect the issue been located somewhere in **PyImport_Import**. I guess the lookup for directories with unicode characters in their path fails. As mentioned, I haven't debugged it though. At least latin-1 is still supported at this stage. – HelloWorld May 25 '16 at 21:28