How to print non-BMP Unicode characters in Tkinter (e.g. )

Question

Note: Non-BMP characters can be displayed in IDLE as of Python 3.8 (so, it's possible Tkinter might display them now, too, since they both use TCL), which was released some time after I posted this question. I plan to edit this after I try out Python 3.9 (after I install an updated version of Xubuntu). I also read the editing these characters in IDLE might not be as straightforward as other characters; see the last comment here.

So, today I was making shortcuts for entering certain Unicode characters. All was going well. Then, when I decided to do these characters (in my Tkinter program; they wouldn't even try to go in IDLE), and , I got a strange unexpected error and my program started deleting just about everything I had written in the text box. That's not acceptable.

Here's the error: _tkinter.TclError: character U+1d12b is above the range (U+0000-U+FFFF) allowed by Tcl

I realize most of the Unicode characters I had been using only had four characters in the code. For some reason, it doesn't like five.

So, is there any way to print these characters in a ScrolledText widget (let alone without messing everything else up)?

UTF-8 is my encoding. I'm using Python 3.4 (so UTF-8 is the default).

I can print these characters just fine with the print statement.

Entering the character without just using ScrolledText.insert (e.g. Ctrl-shift-u, or by doing this in the code: b'\xf0\x9d\x84\xab') does actually enter it, without that error, but it still starts deleting stuff crazily, or adding extra spaces (including itself, although it reappears randomly at times).

Have you tried encoding surrogate pair code units in UTF-8. It's not ideal, but might work. (U+1D12B -> ) — behnam, May 07 '14 at 23:31
No. I'm not sure what that is. Sounds interesting, though. How do I do that? Well, I'll try to figure it out, based on what you said, but feel free to say in a line of code, if you like. — Brōtsyorfuzthrāx, May 07 '14 at 23:32
From what I've seen it looks like they've disabled surrogate pairs in Python 3. Am I wrong? — Brōtsyorfuzthrāx, May 07 '14 at 23:57
Tcl currently mainly assumes that every character is in the range U+000000…U+00FFFF. This is wrong; we know. — Donal Fellows, May 08 '14 at 12:36
Try this: `b'\xED\xA0\xB4\xED\xB4\xAB'`. But don't forget that doing this is against the standard (UTF encoding specs). — behnam, May 08 '14 at 23:15
@J.F.Sebastian To get around it, I actually coded my own way of representing them (kind of like what the people in the bug reported were wanting, except I used ordinal numbers), whether in Text widgets, the tab bar or the open/save dialogs. So, I can use the characters in my editor. They just don't display as anything but codes unless you open them in another program that supports this range. I'll post an answer with the code. — Brōtsyorfuzthrāx, Jan 21 '15 at 20:26
This requires Tcl 8.7 and Tk 8.7 to get fixed (or an extremely unusual build configuration that's not really supported in earlier versions). The project on this was complicated; see [TIP #389](https://core.tcl-lang.org/tips/doc/trunk/tip/389.md) and [TIP #542](https://core.tcl-lang.org/tips/doc/trunk/tip/542.md) among other key spec documents. — Donal Fellows, Nov 24 '20 at 20:18

Brōtsyorfuzthrāx · Accepted Answer · 2015-01-22T05:36:34.030

There is currently no way to display those characters as they are supposed to look in Tkinter in Python 3.4 (although someone mentioned how using surrogate pairs may work [in Python 2.x]). However, you can implement methods to convert the characters into displayable codes and back, and just call them whenever necessary. You have to call them when you print to Text widgets, copy/paste, in file dialogs*, in the tab bar, in the status bar, and other stuff.

*The default Tkinter file dialogs do not allow for much internal engineering of the dialogs. I made my own file dialogs, partly to help with this issue. Let me know if you're interested. Hopefully I'll post the code for them here in the future.

These methods convert out-of-range characters into codes and vice versa. The codes are formatted with ordinal numbers, like this: {119083ū}. The brackets and the ū are just to distinguish this as a code. {119083ū} represents . As you can see, I haven’t yet bothered with a way to escape codes, although I did purposefully try to make the codes very unlikely to occur. The same is true for the ᗍ119083ūᗍ used while converting. Anyway, I'm meaning to add escape sequences eventually. These methods are taken from my class (hence the self). (And yes, I know you don’t have to use semi-colons in Python. I just like them and consider that they make the code more readable in some situations.)

import re;

def convert65536(self, s):
    #Converts a string with out-of-range characters in it into a string with codes in it.
    l=list(s);
    i=0;
    while i<len(l):
        o=ord(l[i]);
        if o>65535:
            l[i]="{"+str(o)+"ū}";
        i+=1;
    return "".join(l);
def parse65536(self, match):
    #This is a regular expression method used for substitutions in convert65536back()
    text=int(match.group()[1:-2]);
    if text>65535:
        return chr(text);
    else:
        return "ᗍ"+str(text)+"ūᗍ";
def convert65536back(self, s):
    #Converts a string with codes in it into a string with out-of-range characters in it
    while re.search(r"{\d\d\d\d\d+ū}", s)!=None:
        s=re.sub(r"{\d\d\d\d\d+ū}", self.parse65536, s);
    s=re.sub(r"ᗍ(\d\d\d\d\d+)ūᗍ", r"{\1ū}", s);
    return s;

But your methods don’t allow to use Label or Button with these unicode characters, do they? — erik, Aug 27 '16 at 09:58
The methods don't allow out of range characters to be displayed at all, on any widget. What it does allow is for codes that represent them to be displayed so you can handle text with those characters (so you can save that text properly and so it doesn't crash and such). This code is probably most useful in text editors and such (so you can actually open and edit files with out of range characters; it's not as useful for just reading). You could display those codes on button and label widgets, since the codes don't contain characters that are out of range, but that might not be what you want. — Brōtsyorfuzthrāx, Aug 29 '16 at 22:29
The methods just convert the out-of-range characters into codes and vice versa, so you can display codes where the characters aren't possible (e.g. in widgets), and convert them back into characters where the characters are possible (such as in a saved txt file, the clipboard, or non-Tkinter Python code). — Brōtsyorfuzthrāx, Aug 29 '16 at 22:40
How would I implement this in real code? I mean do I have to check each string I put somewhere in the Tkinter-GUI? This would be a mess and cause a lot of try-except-blocks... — buhtz, Feb 09 '18 at 22:08
@buhtz Before you print to a tkinter widget, convert the text to characters that will print in a tkinter widget (with `convert65536`). When you get the text from the widget to manipulate, convert it to the actual characters with `convert65536back`. You don't need to call `parse65536`. — Brōtsyorfuzthrāx, Feb 10 '18 at 05:37
You could extend your widget classes to convert things automatically with their methods. You shouldn't need to deal with exceptions. — Brōtsyorfuzthrāx, Feb 10 '18 at 07:24
@Shule I have thousans of strings (in a `Listbox`) I have to check if they have such characters or not. There should be fallback in Tk itself. I don't know the conten of my strings and I don't wan't waste ressources checking all of them. — buhtz, Feb 10 '18 at 07:25

score 0 · Answer 2 · answered Feb 10 '18 at 08:55

My answer is based on @Shule answer but provide more pythnoic and easy to read code. It also provide a real case.

This is the methode populating items to a tkinter.Listbox. There is no back conversion. This solution only take care of displaying strings with Tcl-unallowed characters.

class MyListbox (Listbox):
    # ...
    def populate(self):
        """
        """
        def _convert65536(to_convert):
            """Converts a string with out-of-range characters in it into a
            string with codes in it.

            Based on <https://stackoverflow.com/a/28076205/4865723>.
            This is a workaround because Tkinter (Tcl) doesn't allow unicode
            characters outside of a specific range. This could be emoticons
            for example.
            """
            for character in to_convert[:]:
                if ord(character) > 65535:
                   convert_with = '{' + str(ord(character)) + 'ū}'
                   to_convert = to_convert.replace(character, convert_with)
            return to_convert

        # delete all listbox items
        self.delete(0, END)

        # add items to listbox
        for item in mydata_list:
            try:
                self.insert(END, item)
            except TclError as err:
                _log.warning('{} It will be converted.'.format(err))
                self.insert(END, _convert65536(item))

How to print non-BMP Unicode characters in Tkinter (e.g. )

2 Answers2

Linked