2

I'm making a command-line interpreter for a programming language, and by the interpreter's nature there are a number of purely cosmetic UTF-8 characters to be printed to the screen.

It's occurred to me that perhaps I should accommodate those whose terminals (line-printers?) don't like/support Unicode, or those whose font doesn't have glyphs for some characters.

The way I thought I'd implement this without rewriting a lot of existing printing code is add a command line flag (say, --no-unicode-out), and then do something like the following:

import sys
from unicodedata import normalize

class myStdout(object):
    def __init__(self):
        pass

    def write(self, *args, **kwds):
        return sys.__stdout__.write(
            "".join(" ".join(args).replace("µ", "micro"))
        )

    def flush(self, *args, **kwds):
        return sys.__stdout__.flush()

NO_UNICODE_OUT = bool(len(sys.argv) - 1)

if NO_UNICODE_OUT:
    print("stdout switcheroo")
    sys.stdout = s = myStdout()

print(input("> "))

This feels kinda messy, kinda hacky. Now, that's not always a bad thing, but does this kind of solution make any sense at all, and if not then what's a better solution?


If someone wants to nitpick, by "practical" I mean sensical, efficient, readable, idiomatic, whatever.

cat
  • 3,888
  • 5
  • 32
  • 61
  • 1
    At first glance this seems reasonable. Even better might be to subclass `_io.TextIOWrapper` of which `sys.stdout` is an instance. That would reduce much of the "kinda hacky" you feel. – msw Jan 29 '16 at 19:50
  • 1
    `write` method only needs to take single positional argument, and it be restricted to only accept strings, `print` will do the rest. – GingerPlusPlus Jan 29 '16 at 20:02
  • 3
    I wouldn't discount Unidecode that quickly. I have used it a few times and never had problems. Also, worth mentioning that it was actually updated 8 days ago according to https://pypi.python.org/pypi/Unidecode – Danny Dyla Jan 29 '16 at 20:03
  • @msw Okay, post that as an answer with an example and I'll accept it – cat Jan 29 '16 at 20:08
  • @DannyDyla I found out about Unidecode from an SO answer from 2010, and for some reason I couldn't find a newer version. Thanks! – cat Jan 29 '16 at 20:09
  • @msw I find it is usually a bad idea to override system level functionality. Especially since third party libraries and other developers may depend on an exact and well documented behavior. – Danny Dyla Jan 29 '16 at 20:17
  • @DannyDyla overriding system functionality is not *quite* what I'm doing since it only affects what I tell it to, and it's certainly very documentable. Also, note the override *only* takes place *if* the command line switch is given, not for any or all invocations. – cat Jan 29 '16 at 20:34

3 Answers3

1

A lot got read into my one sentence comment which only spoke to half the OP.

The advantage of subclassing (which I rarely have occasion to even think about) is that it allows specific override of a method while bringing all else along for the ride. I don't think there's dispute here.

However, I do agree with the comments that altering a well-known global scope object is a Bad Thing. What I was thinking of was something like (this is only pseudo-code):

class MyConsole(_io.TextIoWrapper):
    def __init__(self):
        super.__init__()
        # attach self to the same fd as sys.stdout

    def write(self, message):
        self.fd.write(self._asciify(message))

    def print(self, …):
        # optional convenience method
        print(…, file=self.output)

if interactive_console:
    output = sys.stdout 
    if ascii_only:
        output = MyConsole()

    output.print(prompt)
    read_eval_print_loop(sys.stdin, output, …)

What I was not advocating was sys.stdout = anything for, as commenters have noted, there is a likelihood approaching 1.0 of unexpected side-effects. True, my simple comment did not address this aspect of the OP at all.

I did not look at the unidecode package mentioned elsewhere, it might be perfect for all I know. This might have completely re-written that wheel, or the module could be overkill for the task.

msw
  • 42,753
  • 9
  • 87
  • 112
1

.replace("µ", "micro") is not practical. It doesn't handle all other Unicode characters. It is unmanageable to assume that no code will print unprintable Unicode characters ever.

You don't need to change your code if it prints Unicode already (the default): don't hardcode the character encoding of your environment inside your script. There are multiple ways to support Unicode-deficient environments e.g., set PYTHONIOENCODING=:backslashreplace envvar and/or you could set sys.displayhook to format the output like IPython does (note: it might create issues with doctest and other similar modules).

Replacing sys.stdout makes sense if you extend the functionality in a way that is independent from the rest of your interpreter (e.g., you shouldn't put the logic that knows about your interpreter's prompt in there). win-unicode-console package is the example where replacing standard streams may be justified (it can print any Unicode character. Though it doesn't fix displaying non-BMP characters in the default Windows console and naturally the corresponding font has to support the desirable characters too).

The actual solution may use a combination of several approaches depending on what object is best to be responsible for managing the information at a given abstraction level e.g., look how IPython implements color-printing (pyreadline), see Which character encoding is the IPython terminal using?

The question is about cleaning up my own mess if someone's terminal doesn't render what I force feed it.

Even if you need to support only the text that you generate; you shouldn't put .replace("µ", "micro") inside sys.stdout object. Instead, put .replace("µ", "micro") where you generate µ i.e., generate micro instead.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • As I noted in the post and a comment, I don't care at all about the rest of Unicode because if someone's going to type characters into my (hobby) project it is not my problem. The question is about cleaning up my own mess if someone's terminal doesn't render what I force feed it. Your answer is a great one for handling all of Unicode. – cat Jan 30 '16 at 09:38
  • @cat: I've update the answer, to address the comment. Unrelated: all information that is necessary to answer the question should be in the question itself i.e., move the essential parts from comments to your question. – jfs Jan 30 '16 at 09:44
-2

First, the correct way to do this is to use Unidecode which was updated just 8 days ago at the time of this writing.

Secondly, to answer the question about the 'idiomaticness' of the code, I find that overriding system level functions and objects is a hack best left undone. Introducing too much magic in your code makes it harder to read and thus harder to debug (imagine another person reading your code that doesn't know about your hack and the headache he has when the print builtin doesn't work as expected. Now imagine that this developer is you in the future and you don't remember doing it and you can't even ask your past self what you did.). This violates point 2 of PEP 20 - The Zen of Python (Explicit is better than implicit).

When I find myself wanting to override system functions I usually put them in a small wrapper like the following:

def _p(obj):
    #some logic on the object
    print(obj)

With this method, the code is immeasurably more readable and when someone sees the _p function they know that it must be defined somewhere as it isn't a built-in.

Danny Dyla
  • 631
  • 4
  • 15
  • Yeah, I'm playing around with `unidecode` and it does basically exactly what the two lines in my `sys.stdout` override do, except more slowly, because I don't really care about translating *all* of unicode, just the glyphs in my program. – cat Jan 29 '16 at 20:29
  • The issue with your "give your hacky, confusing overrides a new identifier" is that now I need to go back and add an `if` clause to *every* writer in my program to test whether we're writing unicode. By redefining `sys.stdout`, I just put one statement at the top of all the files I want affected, and boom: more maintainable for me now and in the future, and it has the bonus of being localised to one toggleable const *and* my override isn't really that complex, especially with a docstring. – cat Jan 29 '16 at 20:31
  • @cat you only have to replace occurrences of `print` with _p (a simple find and replace) and import it. You put the if statement in the definition of _p so you only have to do it once (that would be the `#some logic on the object` part) and not at each call. Feel free to do what you want but i've been bit by this in the past. You asked "is this idiomatic", not "does this work". The answer is that it is not idiomatic. – Danny Dyla Jan 29 '16 at 20:37
  • Watch out! `unidecode` does **not** play nicely with future.builtins.str (python3 str behavior). Check your code to make sure you never pass `bytes` to `unidecode()`: `str(unidecode(str('bites'))` => `"b'bites'"` if you've done `from builtins import str` – hobs Aug 23 '16 at 18:01