270

Here is some code that I found on the Internet:

class M‮{public static void main(String[]a‭){System.out.print(new char[]
{'H','e','l','l','o',' ','W','o','r','l','d','!'});}}    

This code prints Hello World! onto the screen; you can see it run here. I can clearly see public static void main written, but it is backwards. How does this code work? How does this even compile?

Edit: I tried this code in IntellIJ, and it works fine. However, for some reason it doesn't work in notepad++, along with cmd. I still haven't found a solution to that, so if anyone does, comment down below.

dumbPotato21
  • 5,669
  • 5
  • 21
  • 34

5 Answers5

256

There are invisible characters here that alter how the code is displayed. In Intellij these can be found by copy-pasting the code into an empty string (""), which replaces them with Unicode escapes, removing their effects and revealing the order the compiler sees.

Here is the output of that copy-paste:

"class M\u202E{public static void main(String[]a\u202D){System.out.print(new char[]\n"+
        "{'H','e','l','l','o',' ','W','o','r','l','d','!'});}}   "

The source code characters are stored in this order, and the compiler treats them as being in this order, but they're displayed differently.

Note the \u202E character, which is a right-to-left override, starting a block where all characters are forced to be displayed right-to-left, and the \u202D, which is a left-to-right override, starting a nested block where all characters are forced into left-to-right order, overriding the first override.

Ergo, when it displays the original code, class M is displayed normally, but the \u202E reverses the display order of everything from there to the \u202D, which reverses everything again. (Formally, everything from the \u202D to the line terminator gets reversed twice, once due to the \u202D and once with the rest of the text reversed due to the \u202E, which is why this text shows up in the middle of the line instead of the end.) The next line's directionality is handled independently of the first's due to the line terminator, so {'H','e','l','l','o',' ','W','o','r','l','d','!'});}} is displayed normally.

For the full (extremely complex, dozens of pages long) Unicode bidirectional algorithm, see Unicode Standard Annex #9.

Davis Broda
  • 4,102
  • 5
  • 23
  • 37
  • 1
    You do not explain what the compiler (as opposed to the display routine) does with those Unicode characters themselves. I might ignore them outright (or treat them as white-space), or it might interpret them as actually contributing to the source code. I don't know the Java rules here, but the fact that they are placed at the end of otherwise unused identifiers suggests to me that it might be the latter, and the Unicode characters are in fact part of those identifier names. – Marc van Leeuwen May 13 '17 at 05:01
  • Would this work the same way in c#, out of interest? – IanF1 May 13 '17 at 08:03
  • 14
    @IanF1 It would work in any language where the compiler / interpreter counts RTL and LTR characters as whitespace. But _never do this_ in production code if you at all value the sanity of the next person to touch your code, which could well be you. – wizzwizz4 May 13 '17 at 09:11
  • 3
    Or, in other words: ["Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live."](http://wiki.c2.com/?CodeForTheMaintainer), @IanF1. Or perhaps: "Always code as if the person who ends up maintaining your code will name-and-shame you as the original author on Stack Overflow." – Cody Gray - on strike May 14 '17 at 11:37
44

It looks different because of the Unicode Bidirectional Algorithm. There are two invisible characters of RLO and LRO that the Unicode Bidirectional Algorithm uses to change the visual appearance of the characters nested between these two metacharacters.

The result is that visually they look in reverse order, but the actual characters in memory are not reversed. You can analyse the results here. The Java compiler will ignore RLO and LRO, and treat them as whitespace which is why the code compiles.

Note 1: This algorithm is used by text editors and browsers to visually display characters both LTR characters (English) and RTL characters (e.g. Arabic, Hebrew) together at the same time - hence "bi"-directional. You can read more about the Bidirectional Algorithm at Unicode's website.
Note 2: The exact behaviour of LRO and RLO is defined in Section 2.2 of the Algorithm.

James Lawson
  • 8,150
  • 48
  • 47
  • What is the purpose of such a capability? – Eugene Sh. May 12 '17 at 18:06
  • 7
    These characters are needed sometimes to visually render Arabic and Hebrew correctly. These languages are read and written *right-to-left* (RTL), the first character that is read/written appears on the *right-hand side*. You can read more [here](https://www.w3.org/International/articles/inline-bidi-markup/uba-basics). – James Lawson May 12 '17 at 18:15
  • Arabic and Hebrew characters are intrinsically RTL, though - they'll appear RTL even without an explicit override, and they'll even automatically reverse the ordering of certain other characters nearby, I think mostly punctuation - so explicit overrides are rarely necessary. – user2357112 May 12 '17 at 18:47
  • This page [here](https://www.w3.org/International/articles/inline-bidi-markup/#oppositedirection) describes when the overrides are necessary. @user2357112 is right, they're rarely needed. Indeed when you have punctuation, quotations and numbers - these special characters are considered "neutral". For a computer that can't read the words and understand the context, it's unclear whether to treat them as LTR or RTL, but the bidi algorithm has to pick *some* ordering. Sometimes it "gets it wrong" and you need to use these override characters to "correct it". – James Lawson May 12 '17 at 18:56
  • 4
    Also, U+202E and U+202D are not considered whitespace. Java only considers [ASCII space, horizontal tab, form feed, and CR/LF/CRLF as whitespace](https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.6). They're actually lexically part of the identifiers `M\u202E` and `a\u202D`, but those identifiers appear to be treated as equivalent to `M` and `a`. (The JLS doesn't do a good job of explaining this.) – user2357112 May 12 '17 at 20:40
  • @JamesLawson: Am I the only person who thinks intermixing such presentation-level issues in a character set is crazy? Text layout algorithms need to be able to identify "words" which can be laid out in context-free fashion, but so far as I can tell, Unicode may require an application to scan many characters before and after a piece of text to determine whether it needs to be split into two words. – supercat May 12 '17 at 21:15
  • @supercat no you're not the only one :D. IMO these override characters should be used as a last resort. For example, in HTML5 you can use `` instead to accomplish the same thing as RLO. You're right, the bidi algorithm uses a lot of scanning and processing of the characters to work out the layout. Most operating systems and browsers use a fast C/C++ library called [ICU Bidi](http://userguide.icu-project.org/transforms/bidi) to do all this work. Google Chrome has its own implementation called [RenderText](https://bugs.chromium.org/p/chromium/issues/detail?id=451799#c8). – James Lawson May 12 '17 at 21:50
  • @JamesLawson: Actually, the explicit override characters aren't really the problem, except insofar as they should be recognized as semantic rather than presentational. Most texts that include Hebrew characters have them in right-to-left order, but some have them in reversed order so that, at least in the absence of line breaks, they'll render properly when shown left to right. Knowing which kind of characters appear in a given string is useful. I would suggest that for many purposes, a "programmer's text editor" should show everything in either LTR order, or everything in RTL, but... – supercat May 12 '17 at 21:59
  • @supercat: Unicode bidirectionality doesn't affect word breaking. As for presentation-level issues, it's hard to say. Spaces and line breaks are just as much of a presentation-level issue as bidirectionality, but we're so used to those being part of ASCII that it hardly even occurs to us. (Line breaks weren't always characters!) We don't have a separate Unipresentation standard for presentation-level issues, so making bidirectionality a part of Unicode at least gives us standard semantics and standard tools for dealing with it. – user2357112 May 12 '17 at 21:59
  • ...hilight in some fashion things whose presentation order doesn't match their logical order. If one regards a "word" as a sequence of characters that can be shown consecutively using their respective escapements, bidirectionality can create word breaks since a character sequence like `12+34` would be a single word if it appears in LTR text, or three words if it appears in RTL text. – supercat May 12 '17 at 22:03
  • @supercat: "can be shown consecutively using their respective escapements" sounds like it would apply to any arbitrary LTR text, such as this entire paragraph, so it doesn't seem like a useful concept of a "word". If you meant something else, I don't think whatever you meant would actually change the word count based on bidirectionality. (We're verging into "comments are not for extended discussion" territory, though.) – user2357112 May 12 '17 at 22:09
  • @user2357112: A single-direction paragraph may be flowed by subdividing it into words, rendering each word as a box, and then arranging the boxes. The code to render the text within a box would need to know about things like kerning, but the layout code could be agnostic as to the contents of the boxes. Testing with `x 12+34 א 12+34` it seems I misremembered the exact rules, but replace the + with an ampersand and the point will become clearer. Taking `x 12&34 y 12&34` and replacing the y with an alef yields `x 12&34 א 12&34`. Whether the characters `12&34` are shown consecutively... – supercat May 12 '17 at 22:17
  • ...depends upon the presence of the preceding alef character. If marker characters were required in all texts that would not be rendered in a uniform direction, a layout function would not need to know any rules beyond how to recognize those markers. As it is, I know of no practical way to code a layout engine in something like JavaScript without having to hard-code a whole bunch of Unicode directionality rules. – supercat May 12 '17 at 22:21
  • @supercat: The bidirectionality doesn't affect the division into boxes there, though. It changes the order in which the boxes are laid out and the order of characters within a box, but it doesn't change the division into boxes. `12&34` is still one box in the second version, even if the glyphs within the box are displayed in a new order. The algorithm won't be any more eager than before to insert a line break inside the box, for example. – user2357112 May 12 '17 at 22:37
  • @supercat sometimes you need to embed LTR text in a RTL sentence, which is put in a quote in a long LTR paragraph. The computer can't help you in those deep nested cases. Or even with simple cases [like](https://www.w3.org/International/articles/inline-bidi-markup/#oppositedirection) in [these](https://en.wikipedia.org/wiki/Left-to-right_mark#Example_of_use_in_HTML) [examples](https://en.wikipedia.org/wiki/Right-to-left_mark#Example_of_use_in_HTML) you also need to override the LTR/RTL setting – phuclv May 13 '17 at 02:22
  • @LưuVĩnhPhúc: Embedding RTL text in an LTR sentence or vice versa should entail the insertion of markers to switch direction, which should be accommodated at the application level when the text is inserted. Further, there should be a standard function to take a piece of text and rearrange all the characters into the order they should be placed from left to right or right to left, and then there should be a means of displaying the characters in that order. Given such a function, a function to e.g. wrap text around a curve could easily accommodate a mixture of LTR and RTL scripts. – supercat May 13 '17 at 15:41
  • @LưuVĩnhPhúc: As it is, I know of no practical way by which a function to wrap text around a curve could easily figure out where any particular character should go, or which character should follow another, unless it contained a huge amount of hard-coded logic related to Unicode character directions. – supercat May 13 '17 at 15:43
30

The Character U+202E mirrors the code from right to left, it is very clever though. Is hidden starting in the M,

"class M\u202E{..."

How did I found the magic behind this?

Well, at first when I saw the question I tough, "it's a kind of joke, to lose somebody else time", but then, I opened my IDE ("IntelliJ"), create a class, and past the code... and it compiled!!! So, I took a better look and saw that the "public static void" was backward, so I went there with the cursor, and erase a few chars... And what happens? The chars started erasing backward, so, I thought mmm.... rare... I have to execute it... So I proceed to execute the program, but first I needed to save it... and that was when I found it!. I couldn't save the file because my IDE said that there was a different encoding for some char, and point me where was it, So I start a research in Google for special chars that could do the job, and that's it :)

A little about

the Unicode Bidirectional Algorithm, and U+202E involved, a briefly explain:

The Unicode Standard prescribes a memory representation order known as logical order. When text is presented in horizontal lines, most scripts display characters from left to right. However, there are several scripts (such as Arabic or Hebrew) where the natural ordering of horizontal text in display is from right to left. If all of the text has a uniform horizontal direction, then the ordering of the display text is unambiguous.

However, because these right-to-left scripts use digits that are written from left to right, the text is actually bi-directional: a mixture of right-to-left and left-to-right text. In addition to digits, embedded words from English and other scripts are also written from left to right, also producing bidirectional text. Without a clear specification, ambiguities can arise in determining the ordering of the displayed characters when the horizontal direction of the text is not uniform.

This annex describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit formatting characters for special circumstances. In most cases, there is no need to include additional information with the text to obtain correct display ordering.

However, in the case of bidirectional text, there are circumstances where an implicit bidirectional ordering is not sufficient to produce comprehensible text. To deal with these cases, a minimal set of directional formatting characters is defined to control the ordering of characters when rendered. This allows exact control of the display ordering for legible interchange and ensures that plain text used for simple items like filenames or labels can always be correctly ordered for display.

Why create some algorithm like this?

the bidi algorithm can render a sequence of Arabic or Hebrew characters one after the other from right to left.

developer_hatch
  • 15,898
  • 3
  • 42
  • 75
5

Chapter 3 of the language specification provides an explanation by describing in detail how the lexical translation is done for a Java program. What matters most for the question:

Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.

So a program is written in Unicode characters, and the author can escape them using \uxxxx in case the file encoding does not support the Unicode character, in which case it is translated to the appropriate character. One of the Unicode characters present in this case is \u202E. It is not visually shown in the snippet, but if you try switching the encoding of the browser, the hidden characters may appear.

Therefore, the lexical translation results in the class declaration:

class M\u202E{

which means that the class identifier is M\u202E. The specification considers this as a valid identifer:

Identifier:
    IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
    JavaLetter {JavaLetterOrDigit}

A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int) returns true.

M A
  • 71,713
  • 13
  • 134
  • 174
  • Sorry but this is backward (pun intended). There are no escapes in the source code; you are describing how it could have been written. And, it compiles to a class named "M" (just one character). – Tom Blodget Oct 12 '17 at 23:25
  • @TomBlodget Indeed but the point (which in fact I highlighted in the spec quote) is that the compiler can also process raw Unicode characters. That's really the whole explanation. The escape translation is just an additional info and not directly related to this case. As for the compiled class, I think it's because the RTL switch character is somehow being discarded by the compiler. I will try to see if this is expected, but I think happens after the lexical translation phase. – M A Oct 13 '17 at 05:36
0

This is actually because of Unicode bidirectional support.

U+202E RIGHT TO LEFT OVERRIDE
U+202D LEFT TO RIGHT OVERRIDE

So, those are some tricky characters. They are actually defined for right-to-left language support. The real code is

class M<U+202E>{public static void main(String[]a<U+202D>){System.out.print(new char[]
    {'H','e','l','l','o',' ','W','o','r','l','d','!'});}}

(got this by pasting into cmd.exe). Hope this answer helps you find out how this works.