0

I have a set of strings in unicode.

When I print these out to a file and cat it, this can break my bash terminal: after catting the file, I will get "symbol salad" where everything is just random gibberish (including my command prompt).

I understand that this is probably related to the fact that the strings are unicode strings.


Currently, I am encoding them as ascii strings as follows: my_string.encode('ascii','ignore')

However, this deprecates a lot of data from the strings. Ideally, I would have some way to safely preserve all the unicode data in a file such that the user's terminal does not break when the file is catted.

What is the proper way to do this?

Chris
  • 28,822
  • 27
  • 83
  • 158
  • 3
    What operating system are you using? And, if you are using Linux, what terminal emulator are you using? What version of Python are you using? Finally, please provide a short, complete program that demonstrates the problem. (It can be just 3 lines long if you like, but it should be complete and reproducible). See [mcve] for more information. – Robᵩ Dec 11 '17 at 22:22
  • @Robᵩ python 3, mac osx's default terminal, on a linux remote; struggling to sanitize the strings such that everything cats properly. if it helps, I am able to disable whichever series of bits are breaking into the bash by sed replacing all the numbers, or: `cat the_file.txt | sed 's/[0-9]//g' – Chris Dec 11 '17 at 22:24
  • 1
    Please provide some of those sample strings which blow up your terminal session. Did you try to execute file generation and calling `cat` on your local Mac as well? Are you able to access the Linux machine directly with a dedicated keyboard and monitor (resp. not via a terminal session)? – albert Dec 11 '17 at 22:31
  • @albert the actual strings which blow it up are proving very difficult to find. once they start blowing it up, random 'normal' strings re- and dis-engage the effect throughout the file. I have not been able to discover the exact strings yet, and have been spending some time searching. Was hoping this was a known issue. If it isn't I can delete and hold off for a better version of the question in the am. – Chris Dec 11 '17 at 22:34
  • 1
    A Unicode-compliant terminal shouldn't have a problem with Unicode strings, if it's actually the same encoding (there's more than one Unicode format, after all!). – Charles Duffy Dec 11 '17 at 22:49
  • 1
    (btw, it's not bash that's being broken, but the terminal; when you run `cat somefile.txt`, bash is responsible for actually starting up `/bin/cat`, giving it `somefile.txt` as an argument, and waiting until after that process exited before printing a prompt again; but doesn't know or care what the output of `cat` was -- it's your terminal emulator that receives and handles that data). – Charles Duffy Dec 11 '17 at 22:51
  • @CharlesDuffy haha, ok--thanks for the lingo :) – Chris Dec 11 '17 at 22:51
  • @CharlesDuffy I am interpreting the incoming data as latin-1 from stdin, then printing strings for which I get an error to a log file. My understanding is that latin-1 should still work. However, it seems as if there is something called `punycode` that I should be using... – Chris Dec 11 '17 at 22:53
  • Is a solution that involves using a specific terminal, or changing that terminal's configuration, acceptable? That is, can you just fix your terminal, or do you need your output to work with the lowest common denominator? – Charles Duffy Dec 11 '17 at 22:55
  • @CharlesDuffy I think it is best if the output--since it is for the purpose of error reporting--is somehow encapsulated, or encoded in some standard. – Chris Dec 11 '17 at 22:58
  • @CharlesDuffy I think I can safely report that `'latin-1'` is a dangerous encoding to use, and that `'utf-8'` solved all problems. Still not exactly sure what happened. `'punycode'` did not work. – Chris Dec 11 '17 at 23:04
  • Encoding content for a latin-1 terminal when your terminal is actually utf-8... yeah, could see that being troublesome. – Charles Duffy Dec 11 '17 at 23:06

1 Answers1

0

If you have a file that is encoded in unicode, or any other encoding that a terminal can't handle. The terminal can't handle it, there is not much you can do.

You may have some luck with this answer