I can convert the single unicode character to utf-8 like this
No, you can't.
"\u0026".getBytes()
In java, strings are unicode. This is putting the unicode code point 0026
inside your string. Then, getBytes()
turns that string into a byte array by way of the platform default encoding scheme which is ¯\(ツ)/¯ who knows what it is. On windows probably Cp1252. On a japense computer it might be some kanji variant. It may even throw an exception, if the platform default encoding can't encode that character. On most linux variants the platform default IS UTF-8, but there is no guarantee whatsoever.
new String(thoseBytes, StandardCharsets.UTF_8)
If the platform default encoding is UTF_8, you've accomplished nothing whatsoever: You've taken a string, turned it to bytes via UTF-8, and then turned those bytes into a string with UTF-8, thus guaranteeing you end up with the original. This is a silly, inefficient way to write: `final String str2 = "\u0026";.
If the platform default is not UTF-8, then you've just done a gobbledygook transformation that means nothing. str2
contains garbage. Given that \u0026 means the same symbol in many encodings, especially encodings that tend to be platform defaults, most likely you get 'lucky' and str2
remains the string "\u0026"
. But there are no guarantees.
So, what you've done is convert nothing - or, you've converted a string into garbage (the same way taking an image, saving it as a PNG, and then reading that PNG using a JPG decoder either crashes the decoder and will produce meaningless garbage). Either one sounds rather useless.
Try it:
System.out.println("\u0026");
just run that. It will print the ampersand character, always, whereas your code merely does so on most platforms, but not all.
Now I want to print it for a given range for e.g. [\u0621-\u0652]
It's as simple as it sounds like.
char start = '\u0621';
char end = '\u0652';
for (int c = start; c <= end; c++) {
System.out.println(c);
}
You seem to be confused about what UTF-8 and unicode are.
unicode is a giant table. It maps numbers, such as 38 (\u0026 is in hex notation: That's hex for 38), to a concept, generally a character, such as 'an ampersand'.
It does not describe anything more. In particular it does not say that the byte 38 means ampersand. It doesn't mention bytes at all; unicode has no idea what a byte is.
The obvious followup for a programmer is then: Okay, great, so if I have, say, "Hello & Goodbye!" as a string, unicode tells me exactly which sequence of numbers properly describe each and every character inside it. But what do I then do with my 'bunch o numbers'? How shall I encode these in a file (which are a bag-of-bytes. Given that unicode defines a huge range, and bytes can only describe up to 256 numbers, you can't just go: "Well, store every number as a byte").
THAT is where UTF-8 comes in. UTF-8 isn't the same as unicode. It is an encoding to store numbers. Specifically, designed to efficiently store the kinds of numbers you are likely to get when converting strings to a series of numbers by mapping them to their unicode number.
Thus, '\u0621'
is not UTF. It's the character, in unicode, directly. That character encoded as UTF-8 would in fact be the two-byte sequence 0xD8 0xA1
. That looks nothing like 0621.
Try it:
byte[] b = new byte[] { (byte) 0xD8, (byte) 0xA1 };
String s = new String(b, StandardCharsets.UTF_8);
System.out.println("The string: " + s);
System.out.println("The codepoint for that first char: " + (int) s.charAt(0));
That will print:
The String: ء
The codepoint for that first char: 1569
1569 is the decimal version of 0x0621.
NB: As Mike pointed out in the comments, if you truly want to work with unicode characters, they are called 'codepoints', and char
can't quite store them. You'd use .getCodepointAt()
and friends from the string class, but that's quite advanced, complicates the examples, and isn't important for answering the question.