Consider a unicode character, such as zero-width space, which is not on any conventional keyboard and is not part of any human writing system. Suppose one wants to use perl to remove this character from a string, or one wants to print the character in bash unix.
This post reviews how one can do these things using hexadecimal code, and then asks: Is there a more direct (or elegant) way to do these things, using perhaps the decimal representation of the character?
The "zero-width space" http://www.unicode-symbol.com/u/200B.html shows up occasionally in text files.
For instance, on a macbook pro, from Messages.app, I saved an sms conversation as pdf. Then I opened the pdf in Preview, copied all, and pasted the clipboard into a file z
. Then less z
showed many instances of <U+200B>
,
and when I opened it in vim
it showed up as <200b>
.
Similarly, "pop directional formatting", http://www.unicode-symbol.com/u/202C.html, shows up when I copy and paste a phone number from the telephone field of Contacts.app.
Often I want to get the plain text from a string---anything that a human being would actually want to read, including letters in any language such as French é, Greek β, Arabic, Chinese and of course tab, space, and newline---without other characters.
This is because the other characters can cause problems. Not only are they a distraction in less and vim, but they seem to cause LaTeX, pdflatex, to throw an error.
One can remove "zero-length space" as follows:
- go to the url for the character, as cited above
- scroll down to the table titled "Encodings (Unicode characters converter)"
- on the UTF-8 row, find the text "E2 80 8B"
- By hand, convert this to
\xe2\x80\x8b
perl -p -e 's/\xe2\x80\x8b//g;' myfile
Using the same approach, one can print the character:
printf '\xe2\x80\x8b'
But on the same row
in http://www.unicode-symbol.com/u/200B.html
where one obtains the triad of hexadecimal numbers, one also finds that the decimal representation is 14844043
. Is there a way to use this decimal representation, or some other approach more direct than pasting together three hexadecimal codes?