9

The Java language specification states that the escapes inside strings are the "normal" C ones like \n and \t, but they also specify octal escapes from \0 to \377. Specifically, the JLS states:

OctalEscape:
    \ OctalDigit
    \ OctalDigit OctalDigit
    \ ZeroToThree OctalDigit OctalDigit

OctalDigit: one of
    0 1 2 3 4 5 6 7

ZeroToThree: one of
    0 1 2 3

meaning that something like \4715 is illegal, despite it being within the range of a Java character (since Java characters are not bytes).

Why does Java have this arbitrary restriction? How are you meant to specify octal codes for characters beyond 255?

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • 1
    255 is the basic ASCII limit if I'm not mistaken, so you have one for every single base ASCII character. Shouldn't you be happy with that much? The reason you can't go up to, say \4715 is simply because it's over 255 which is standard ASCII limit =D (I'm bad at explaining, refer to answerer) –  Mar 03 '12 at 03:47
  • 1
    @Shingetsu: the ASCII limit is 127, not 255. _Bytes_ are limited to 255, unless you're talking about Java bytes which are, for some bizarre reason, signed :-) But Java characters are not bytes. – paxdiablo Mar 03 '12 at 04:30
  • [See also](http://stackoverflow.com/questions/3537706/howto-unescape-a-java-string-literal-in-java/4298836) – Drew Stephens Apr 02 '14 at 01:11

4 Answers4

12

It is probably for purely historical reasons that Java supports octal escape sequences at all. These escape sequences originated1 in C, in the days when computers like the PDP-7 ruled the Earth, and much programming was done in assembly or directly in machine code, and octal was the preferred number base for writing instruction codes, and there was no Unicode, just ASCII, so three octal digits were sufficient to represent the entire character set.

By the time Unicode and Java came along, octal had pretty much given way to hexadecimal as the preferred number base when decimal just wouldn't do. So Java has its \u escape sequence that takes hexadecimal digits. The octal escape sequence was probably supported just to make C programmers comfortable, and to make it easy to copy'n'paste string constants from C programs into Java programs.

Check out these links for historical trivia:

http://en.wikipedia.org/wiki/Octal#In_computers
http://en.wikipedia.org/wiki/PDP-11_architecture#Memory_management


  1. C's immediate predecessors, BCPL and B, used * instead of \ to introduce string escape sequences. However, neither of those languages had octal escape sequences documented in the manuals linked.
rob mayoff
  • 375,296
  • 67
  • 796
  • 848
  • 1
    +1 Also note, even aside from writing instruction codes, octal is much easier than hex when you're working on (for example) an architecture with 36-bit words and 9-bit characters -- 12 octal digits exactly displays one machine word, with 3 digits for each character. If you represent that same 36-bit word with 9 hex digits, you can't easily tell the value of individual chars. – David Gelhar Mar 03 '12 at 05:24
  • 1
    As my answer below explains, the \uXXXX and the octal escape sequences are parsed at very different stages. A \uXXXX escape sequence is NOT an extended version of C's octal escape sequence. Just put an \u000A in a string, and your program will stop compiling. – Sven Aug 19 '13 at 05:41
2

If I can understand the rules (please correct me if I am wrong):

\ OctalDigit
Examples:
    \0, \1, \2, \3, \4, \5, \6, \7

\ OctalDigit OctalDigit
Examples:
    \00, \07, \17, \27, \37, \47, \57, \67, \77

\ ZeroToThree OctalDigit OctalDigit
Examples:
    \000, \177, \277, \367,\377

\t, \n, \\ do not fall under OctalEscape rules; they must be under separate escape character rules.

Decimal 255 is equal to Octal 377 (use Windows Calculator in scientific mode to confirm)

Hence a three-digit Octal value falls in the range of \000 (0) to \377 (255)

Therefore, \4715 is not a valid octal value as it is more than three-octal-digits rule. If you want to access the code point character with decimal value 4715, use Unicode escape symbol \u to represent the UTF-16 character \u126B (4715 in decimal form) since every Java char is in Unicode UTF-16.

from http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html:

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

Edited:

Anything that beyond the valid octal value of 8-bit range (larger than one byte) is language-specific. Some programming languages may carry on to match Unicode implementation; some may not (limit it to one byte). Java definitely does not allow it even though it has Unicode support.

A few programming languages (vendor-dependent) that limit to one-byte octal literals:

  1. Java (all vendors): - An octal integer constant that begins with 0 or single-digit in base-8 (up to 0377); \0 to \7, \00 to \77, \000 to \377 (in octal string literal format)
  2. C/C++ (Microsoft) - An octal integer constant that begins with 0 (up to 0377); octal string literal format \nnn
  3. Ruby - An octal integer constant that begins with 0 (up to 0377); octal string literal format \nnn

A few programming languages (vendor-dependent) that support larger-than-one-byte octal literals:

  1. Perl - An octal integer constant that begins with 0; octal string literal format \nnn See http://search.cpan.org/~jesse/perl-5.12.1/pod/perlrebackslash.pod#Octal_escapes

A few programming languages do not support octal literals:

  1. C# - use Convert.ToInt32(integer, 8) for base-8 How can we convert binary number into its octal number using c#?
Community
  • 1
  • 1
ecle
  • 3,952
  • 1
  • 18
  • 22
  • Yes, I _know_ the limits. My question is not what the limits are, but rather _why_ those limits are there at all, given that Java characters are not limited to the range 0-255. I'll clarify the question. – paxdiablo Mar 03 '12 at 04:32
  • Of course, Java is using Unicode 16-bit wide for `String` and `char`. But now, you are using escape `\ ` symbol and you use it to represent an octal value which only allows up to `\377` in Java octal escape format or 255 in decimal value. Java octal escape format `\4715 ` is not a valid octal escape format because it is more than three digits according to OctalEscape rules in JLS. – ecle Mar 03 '12 at 04:33
  • If you want to access more than 255 code points under Unicode UTF-16 String/char, use Unicode symbol `\u `. So, for code point 4715(?) is `\u4715` (the correct form, I think it should be `\u126B` for decimal 4715) – ecle Mar 03 '12 at 04:41
  • 1
    @eee I think you're really missing the point of the question. pax is certainly capable of figuring out the right hex escape to use for a given code point; his question is: "why, when Java defines 16-bit characters, does it also define an octal escape syntax for characters that stops at 8 bits?" – David Gelhar Mar 03 '12 at 05:11
  • @DavidGelhar It does as JLS has stated the OctalEscape rules where we need to use `\ ` symbol from `\0` up to `\377` that is from decimal 0 to 255. This is in line with C octal escape rule to represent decimal 0 to 255 only (8-bit range). Even though Java can address in 16-bit for char type, octal values never go beyond decimal 255 limit. However C `char` type can be a vendor-specific implementation and it may not be the common 8-bit type. In C++11, a new `char16_t` is introduced to represent UTF-16 characters. A `wchar_t` type may represent a wide character. – ecle Mar 03 '12 at 05:28
  • @DavidGelhar Anyway, I will check if C can allow beyond `\377` limit for octal representation to address beyond 8-bit. But, definitely Java does not. – ecle Mar 03 '12 at 05:32
  • "use Windows Calculator in scientific mode to confirm" made me LOL. So cute :) (And actually it's programmer mode, not scientific, but who cares) – Franz D. Jun 14 '21 at 12:09
0

I know of no reason why octal escapes are restricted to unicode codepoints 0 to 255. This might be for historical reasons. The question will basically remain unanswered as there was no technical reason not to increase the range of the octal escapes during the design of Java.

It should be noted however, that there's a not so obvious difference between the unicode escapes and the octal escapes. The octal escapes are processed only as part of strings while the unicode-escapes can occur anywhere in a file, for example as part of the name of a class. Also note, that the following example will not even compile:

String a = "\u000A";

The reason is, that \u000A is expanded to a newline at a very early stage (basically when loading the file). The following code does not generate an error:

String a = "\012";

The \012 is expanded after the compiler has parsed the code. This also holds for the other escapes like \n, \r, \t, etc.

So in conclusion: the unicode escapes are NOT a replacement for the octal escapes. They are a completely different concept. In particular, to avoid any problems (as with \u000A above), one should use octal escape for codepoints 0 to 255 and unicode escapes for codepoints above 255.

Sven
  • 1,364
  • 2
  • 17
  • 19
0

The \0-\377 octal escapes are also inherited from C, and the restriction makes a fair amount of sense in a language like C where characters == bytes (at least in the halcyon days before wchar_t).

David Gelhar
  • 27,873
  • 3
  • 67
  • 84