3

Hi Have following string,

Let\342\200\231s start with the most obvious question first. This is what an \342\200\234unfurl\342\200\235 is

It is supposed to be displayed as The first three numbers (\342\200\231) actually represent a octal sequence http://graphemica.com/%E2%80%99 and its unicode equivalent is \u2019

Similarly \342\200\234 represents a octal sequence http://graphemica.com/%E2%80%9C and its unicode equivalent is \u201C

Is there any library or function which I can use to convert these octal sequences to their unicode equivalent?

Vivek Kothari
  • 462
  • 6
  • 20
  • Does your string contain the octal sequences written out literally, e.g., the actual characters _backslash_, digit three_, _digit four_, _digit two_, _backslash_, _digit two_, _digit zero_, etc.? – Kevin Anderson May 30 '18 at 08:14
  • yep.. look at the sample string in the question, its literally like that – Vivek Kothari May 30 '18 at 08:17
  • 1
    What is the source of the string? Is it being read from a text file, or is it written like that as a string literal in a Java source file? This would make a big difference to the answer. – DodgyCodeException May 30 '18 at 09:52
  • 1
    Note: Saying "unicode equivalent" isn't quite right. That's a Java source file UTF-16 escape. Buth UTF-8 and UTF-16 are encodings for the Unicode character set. Your octal bytes use the the UTF-8 encoding. – Tom Blodget May 30 '18 at 11:54

2 Answers2

5

The bytes you show are (a representation of) UTF-8 encoding, which is only one of many forms of Unicode. Java is designed to handle such encodings as byte sequences (such as arrays, and also streams), but not as chars and Strings. The somewhat cleaner way is to actually use bytes, but then you have to deal with the fact that Java bytes are signed (-128 .. +127) and all multibyte UTF-8 codes are (by design) in the upper half of 8-bit space:

byte[] a = {'L','e','t',(byte)0342,(byte)0200,(byte)0231,'s'};
System.out.println (new String (a,StandardCharsets.UTF_8));
// or arguably uglier
byte[] b = {'L','e','t',0342-256,0200-256,0231-256,'s'};
System.out.println (new String (b,StandardCharsets.UTF_8));

But if you want something closer to your original you can cheat just a little by treating a String (of unsigned chars) that actually contains the UTF-8 bytes as if it contained the 8-bit characters that form Unicode range 0000-00FF which is defined to be the same as ISO-8859-1:

byte[] c = "Let\342\200\231s".getBytes(StandardCharsets.ISO_8859_1);
System.out.println (new String (c,StandardCharsets.UTF_8));
dave_thompson_085
  • 34,712
  • 6
  • 50
  • 70
  • `"Let\342\200\231s"` is loopy but could very well be what the question requires. Source file encoding → octal string escape for UTF-16 code units → reinterpreted as ISO 8895-1 bytes → converted back to UTF-16 as UTF-8 bytes. Good job. Production code would need lots of code comments. – Tom Blodget May 30 '18 at 12:03
  • @TomBlodget < sir, did you read vivek's comment at below my answer? The input is actually without \(escape) characters. – Soner from The Ottoman Empire May 30 '18 at 12:08
  • @snr \342 is a Java string escape for \u00E2. Your answer escapes the \ to make a backslash character. – Tom Blodget May 30 '18 at 12:11
  • 1
    But what's the use of code like this? If the string literal is in a Java source file, why not just change the source file to use `\uNNNN`? – DodgyCodeException May 30 '18 at 13:26
  • @DodgyCodeException Not only source code, [Google Protobuf prints strings in octal format](https://stackoverflow.com/q/62965214/839733) by default. You can change that, but it helps to know how to convert. – Abhijit Sarkar Jul 19 '20 at 03:38
  • @AbhijitSarkar: if you mean my 'c', as I said it contains the UTF-8 bytes as block 0 Unicode characters, and the 8859-1 encoding of those characters is the same numeric values and thus the desired UTF-8 encoding. If you mean 'a' or 'b', those seem so obvious I don't know how to explain them; what don't you understand? – dave_thompson_085 Jul 20 '20 at 06:16
  • @dave_thompson_085 I meant `c`. The part that I didn't understand is once you converted to `ISO_8859_1`, you essentially lost the 2nd byte of the Unicode characters, so when you convert back to `UTF-8`, why does it work? Or is it that when you convert back to UTF-8, two bytes are combined to create one UTF-8 output character? – Abhijit Sarkar Jul 20 '20 at 19:09
  • @AbhijitSarkar: all of the (UTF-16) characters in the String are in the range 0000-00FF (as I posted) so when encoded as 8859-1 nothing is lost; _that_ set of characters, and only that set, corresponds to the 256 values of one (8-bit) byte of 8859-1. The _three_ bytes in the resulting encoding (which were produced from three characters in the String) are then _decoded_ as UTF-8 to the single Unicode character U+2019. – dave_thompson_085 Jul 28 '20 at 00:23
  • Got it, thanks. I did say "two bytes are combined to create one UTF-8 output character", although it's actually three bytes, not two, had missed that part. – Abhijit Sarkar Jul 28 '20 at 00:36
-1

In Java, this is not possible with Octals, only with Hexa.

This works fine:

System.out.println("\u2019");

It is probably for purely historical reasons that Java supports octal escape sequences at all. These escape sequences originated in C (or maybe in C's predecessors B and BCPL), in the days when computers like the PDP-7 ruled the Earth, and much programming was done in assembly or directly in machine code, and octal was the preferred number base for writing instruction codes, and there was no Unicode, just ASCII, so three octal digits were sufficient to represent the entire character set.

By the time Unicode and Java came along, octal had pretty much given way to hexadecimal as the preferred number base when decimal just wouldn't do. So Java has its \u escape sequence that takes hexadecimal digits. The octal escape sequence was probably supported just to make C programmers comfortable, and to make it easy to copy'n'paste string constants from C programs into Java programs.

Guilherme Mussi
  • 956
  • 7
  • 14
  • the decode function doesn't convert it to unicode. Integer.decode("\342\200\234") throws exception – Vivek Kothari May 30 '18 at 07:16
  • Let me know if it produces what you expect. – Guilherme Mussi May 30 '18 at 07:21
  • Nope.. it produces output of < â > it should produce < ’ > – Vivek Kothari May 30 '18 at 07:24
  • Ok, I researched about this and this is actually not possible in Java with Octals. This is only possible with Hexa. – Guilherme Mussi May 30 '18 at 07:31
  • If not Java, can we do it using Javascript? – Vivek Kothari May 30 '18 at 08:18
  • I think for that you need to create a different question. Maybe you can link to this one. If you could, please accept my answer. Thanks. – Guilherme Mussi May 30 '18 at 09:22
  • Just checked and Javascript has the same problem with Octals: the range is too low. – Guilherme Mussi May 30 '18 at 09:24
  • PDP-7's never ruled anything, they were too few (and ugly), but PDP-11's were much more numerous and longer-lasting _and_ were where C was born and raised (along with Unix), and pretty much all instructions except branches and traps used 3-bit and 6-bit fields, ideal for octal. The DEC-supplied debugger was Octal Debugging Tool (ODT), and some later 11 models had 'console ODT' in _firmware_. Unix from almost its earliest days had `od` (octal dump). – dave_thompson_085 Jul 20 '20 at 06:16