4

The problem.

I'm using Python 2.7 build on Sublime Text 3 and have an issue with printing out.
In some cases I get a pretty confusing output for '\uFFFD' - the 'REPLACEMENT CHARACTER'.


For example:

print u'\ufffd' # should be '�' - the 'REPLACEMENT CHARACTER'
print u'\u0061' # should be 'a'
-----------------------------------------------------
[Finished in 0.1s]

After inversion of the order:

print u'\u0061' 
print u'\ufffd'
-----------------------------------------------------
a
�
[Finished in 0.1s]

So, Sublime can printout the '�' character, but for some reason doesn't do it in the 1st case.
And the dependence of the output on the order of statements seems quite strange.


The problem with replacement char leads to very unpredictable printout behavior in general.
For example, I want to printout decoded bytes with error replacement:

cp1251_bytes = '\xe4\xe0' # 'да' in cp1251 
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
��
[Finished in 0.1s]

Let's replace the bytes:

cp1251_bytes = '\xed\xe5\xf2' # 'нет' in cp1251
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
[Finished in 0.1s]

And add one more print statement:

cp1251_bytes = '\xed\xe5\xf2' # 'нет' in cp1251 
print cp1251_bytes.decode('cp1251') 
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
нет
���
[Finished in 0.1s]

Below is the illustration of implementation some other test cases:

enter image description here


Summarizing, there are the following patterns in the described printout behavior:

  • it depends on the even/odd number of '\ufffd' chars in print statement
  • it depends on the order of print statements
  • it depends on the specific build run

    My questions:

  • Why does this happen?
  • How to fix the problem?


    My Python 2.7 sublime-build file:

    {   
        "cmd": ["C:\\_Anaconda3\\envs\\python27\\python", "-u", "$file"],
        "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
        "selector": "source.python",
        "env": {"PYTHONIOENCODING": "utf-8"}
    }
    

    With Python 2.7 installed separately from Anaconda the behavior is exactly the same.

  • MaximTitarenko
    • 886
    • 4
    • 8

    2 Answers2

    1

    Edit-1 - Using UTF8 with BOM

    Seems like BOM becomes important in case of windows. So you need to use below type build config

    {   
        "cmd": ["F:\\Python27-14\\python", "-u", "$file"],
        "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
        "selector": "source.python",
        "env": {
            "PYTHONIOENCODING": "utf_8_sig"
        },
    }
    

    After that it works correctly for me on windows also

    build settings

    correct output

    Original Answer

    I checked the issue and I didn't face the same on Python 2.7 with Sublime text. The only change being I had to add # -*- coding: utf-8 -*- to the top of the file. Which seems the missing part in this question

    # -*- coding: utf-8 -*-
    
    print u'\u0061' # should be 'a'
    print u'\ufffd' # should be '�' - the 'REPLACEMENT CHARACTER'
    

    After that the reversal has no impact

    print 1

    print 2

    You can see more details about this required header on

    Why declare unicode by string in python?

    Below is summary of the above link

    When you specify # -*- coding: utf-8 -*-, you're telling Python the source file you've saved is utf-8. The default for Python 2 is ASCII (for Python 3 it's utf-8). This just affects how the interpreter reads the characters in the file.

    Tarun Lalwani
    • 142,312
    • 9
    • 204
    • 265
    • Thanks for your contribution, but your assumption is **absolutely** not the reason. I didn't indicate the encoding declaration in the first examples, because I felt that its presence was obvious. By the way if you try to run the 1st example without the encoding declaration, you'll wind up with the `SyntaxError` - because there is the non-ASCII byte `\xef` from which the char `�` in a comment starts. So, there is no way to get the results from my examples without the encoding declaration. And finally - you can see I used the `# coding: utf-8` at the top of the gif from my question. – MaximTitarenko Nov 15 '17 at 15:41
    • Will delete this answer and workout what else is the issue. But lets chat first one the few debug infos. Can you download non anaconda python directly from python site and try both Python 2.7.X and Python 3.6.X. The latest versions. Can you describe if you can reproduce the issue across? What happens when you run the same file in terminal? – Tarun Lalwani Nov 15 '17 at 15:47
    • By the way - here is the [another example of such a behavior](https://stackoverflow.com/questions/47020241/sublime-text-3-not-printing-correctly) - it seems that the problem is related to a whole group of characters, not just the `\ufffd` – MaximTitarenko Nov 15 '17 at 15:47
    • Few more things. In you build config can you try `"env": {"PYTHONIOENCODING": "utf8"}`. So `utf8` without `-`. Not sure if it is gonna help but worth a shot. Also please add another field in build config named `encoding: ` and try values like `utf8`, `utf-8`, `cp1252` and see if anything of that helps you – Tarun Lalwani Nov 15 '17 at 16:00
    • with non-Anaconda Python 2.7 and the build `{ "cmd": ["C:\\Python27\\python", "-u", "$file"], ...` the behavior is just the same. If I run the program in terminal (Ubuntu) it works fine. Python 3.6 + Sublime Text works fine too. – MaximTitarenko Nov 15 '17 at 16:00
    • `"env": {"PYTHONIOENCODING": "utf8"}` changes nothing, as well as adding: `"encoding": "utf8"`. However, `"encoding": "cp1252"` prints out for every replacement character the following: `�`, which is actually the result of: `'�'.decode('cp1252').encode('utf-8')` – MaximTitarenko Nov 15 '17 at 16:29
    • I think I figured it out. You need to use `"env": {"PYTHONIOENCODING": "utf_8_sig"}`. Please try and let me know if it works – Tarun Lalwani Nov 15 '17 at 18:17
    • With `"utf_8_sig"` it behaves the exactly opposite way: `u\ufffd` is printed, `u\ufffd\ufffd` is not etc. It seems that the solution is somewhere nearby. – MaximTitarenko Nov 15 '17 at 18:44
    • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/159108/discussion-between-tarun-lalwani-and-maximtitarenko). – Tarun Lalwani Nov 16 '17 at 04:18
    1

    I've reproduced your problem and I've found a solution that works on my platform anyhow: Remove the -u flag from your cmd build config option.

    I'm not 100% sure why that works, but it seems to be a poor interaction resulting from the console interpreting an unbuffered stream of data containing multi-byte characters. Here's what I've found:

    • The -u option switches Python's output to unbuffered
    • This problem is not at all specific to the replacement character. I've gotten similar behaviour with other characters like "あ" (U+3042).
    • Similar bad results happen with other encodings. Setting "env": {"PYTHONIOENCODING": "utf-16be"} results in print u'\u3042' outputting 0B.

    That last example with the encoding set to UTF-16BE illustrates what I think is going on. The console is receiving one byte at a time because the output is unbuffered. So it receives the 0x30 byte first. The console then determines this is not valid UTF-16BE and decides instead to fallback to ASCII and thus outputs 0. It of courses receives the next byte right after and follows the same logic to output B.

    With the UTF-8 encoding, the console receives bytes that can't possibly be interpreted as ASCII, so I believe the console is doing a slightly better job at properly interpreting the unbuffered stream, but it is still running into the difficulties that your question points out.

    DPenner1
    • 10,037
    • 5
    • 31
    • 46