1

Single character conversion is

final String str2 = "\u0026";
System.out.println(str2); // which ­prints & character

Now I want to print it for a given range for e.g. [\u0621-\u0652] but I am not sure how to increment uniocde characters in the loop to print individual characters in utf-8.

user1298426
  • 3,467
  • 15
  • 50
  • 96
  • A String *is* UTF-8 characters. When you call `getBytes()`, you are almost certainly corrupting your character data, because getBytes() uses the system’s default charset, which on Windows is a one-byte charset like windows-125x. **Do not convert a String to bytes and then back to a String.** Just `String str2 = "\u0621";` is sufficient. – VGR Nov 04 '20 at 16:14
  • Your first line is pointless, because you construct a string from a byte array that you get from a string. You could just skip the two transformation steps and have the same result. Second, you assume that the platform default encoding is UTF-8 (which is true on Android and some Linux systems, but far from universal). – Joachim Sauer Nov 04 '20 at 16:14
  • updated the description. – user1298426 Nov 04 '20 at 16:20
  • Just literally create String objects using the "from unicode codepoint" constructor (see answer). – Mike 'Pomax' Kamermans Nov 04 '20 at 16:40
  • @VGR "*A String is UTF-8 characters*" - a Java `String` is actually a sequence of UTF-16 codeunits, not UTF-8. You can construct a `String` from a UTF-8 byte sequence, and convert a `String` to a UTF-8 byte sequence, but a `String` will never *contain* UTF-8. – Remy Lebeau Nov 05 '20 at 00:52
  • Note: you should check on unicode website for block diagrams. Single Unicode codepoints may not be relevant, or just they may show a incomplete image: there are variants (some chars can be show in multiple ways), combining characters, ligatures, context (e.g. initial, final, middle, and separated character have same codepoint, but different glyphs in Arabic), language (fonts may have variants according language) etc. – Giacomo Catenazzi Nov 05 '20 at 10:08

2 Answers2

6

I can convert the single unicode character to utf-8 like this

No, you can't.

"\u0026".getBytes()

In java, strings are unicode. This is putting the unicode code point 0026 inside your string. Then, getBytes() turns that string into a byte array by way of the platform default encoding scheme which is ¯\(ツ)/¯ who knows what it is. On windows probably Cp1252. On a japense computer it might be some kanji variant. It may even throw an exception, if the platform default encoding can't encode that character. On most linux variants the platform default IS UTF-8, but there is no guarantee whatsoever.

new String(thoseBytes, StandardCharsets.UTF_8)

If the platform default encoding is UTF_8, you've accomplished nothing whatsoever: You've taken a string, turned it to bytes via UTF-8, and then turned those bytes into a string with UTF-8, thus guaranteeing you end up with the original. This is a silly, inefficient way to write: `final String str2 = "\u0026";.

If the platform default is not UTF-8, then you've just done a gobbledygook transformation that means nothing. str2 contains garbage. Given that \u0026 means the same symbol in many encodings, especially encodings that tend to be platform defaults, most likely you get 'lucky' and str2 remains the string "\u0026". But there are no guarantees.

So, what you've done is convert nothing - or, you've converted a string into garbage (the same way taking an image, saving it as a PNG, and then reading that PNG using a JPG decoder either crashes the decoder and will produce meaningless garbage). Either one sounds rather useless.

Try it:

System.out.println("\u0026");

just run that. It will print the ampersand character, always, whereas your code merely does so on most platforms, but not all.

Now I want to print it for a given range for e.g. [\u0621-\u0652]

It's as simple as it sounds like.

char start = '\u0621';
char end = '\u0652';
for (int c = start; c <= end; c++) {
    System.out.println(c);
}

You seem to be confused about what UTF-8 and unicode are.

unicode is a giant table. It maps numbers, such as 38 (\u0026 is in hex notation: That's hex for 38), to a concept, generally a character, such as 'an ampersand'.

It does not describe anything more. In particular it does not say that the byte 38 means ampersand. It doesn't mention bytes at all; unicode has no idea what a byte is.

The obvious followup for a programmer is then: Okay, great, so if I have, say, "Hello & Goodbye!" as a string, unicode tells me exactly which sequence of numbers properly describe each and every character inside it. But what do I then do with my 'bunch o numbers'? How shall I encode these in a file (which are a bag-of-bytes. Given that unicode defines a huge range, and bytes can only describe up to 256 numbers, you can't just go: "Well, store every number as a byte").

THAT is where UTF-8 comes in. UTF-8 isn't the same as unicode. It is an encoding to store numbers. Specifically, designed to efficiently store the kinds of numbers you are likely to get when converting strings to a series of numbers by mapping them to their unicode number.

Thus, '\u0621' is not UTF. It's the character, in unicode, directly. That character encoded as UTF-8 would in fact be the two-byte sequence 0xD8 0xA1. That looks nothing like 0621.

Try it:

byte[] b = new byte[] { (byte) 0xD8, (byte) 0xA1 };
String s = new String(b, StandardCharsets.UTF_8);
System.out.println("The string: " + s);
System.out.println("The codepoint for that first char: " + (int) s.charAt(0));

That will print:

The String: ء
The codepoint for that first char: 1569

1569 is the decimal version of 0x0621.

NB: As Mike pointed out in the comments, if you truly want to work with unicode characters, they are called 'codepoints', and char can't quite store them. You'd use .getCodepointAt() and friends from the string class, but that's quite advanced, complicates the examples, and isn't important for answering the question.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • Unicode codepoints haven't fit in a `char` since Unicode 3, please don't recommend people use chars for this, it just gives them code that seems to work, and then breaks when they go "hey it's 2020, let me do this for emoji" – Mike 'Pomax' Kamermans Nov 04 '20 at 16:28
  • 1
    The answer is already quite long, yet another sidetrack into higher plane chars is turning an answer into a lecture series, no? – rzwitserloot Nov 04 '20 at 16:28
  • it _is_ quite long, without any good reason: String [literally has a constructor](https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#String(int[],%20int,%20int)) for doing this, using `int`, not `char`, so this wall of text should really be "you can do this by using the right String constructor" instead of a wall of text. – Mike 'Pomax' Kamermans Nov 04 '20 at 16:29
  • 2
    I disagree that using that constructor would have made things easier. I have added a note at the end in the highly unlikely case it is appropriate for OP to worry about it at this phase of their programming career. – rzwitserloot Nov 04 '20 at 16:30
  • that is a very strange attitude to have: creating the Strings has been trivial since Java 1.5, as has the `getBytes` function for turning java strings into true UTF8. This answer is a giant wall of text that goes "no" when the answer is "yes, and it's easy, and has been for a decade+" – Mike 'Pomax' Kamermans Nov 04 '20 at 16:44
  • 1
    @Mike'Pomax'Kamermans: check how the question was originally phrased. This wall of text seemed very necessary to fix the misunderstandings that were implied by OPs phrasing of the question. I agree that this answer looks weird when posted next to the current version of the question. – Joachim Sauer Nov 05 '20 at 16:14
  • if I look at the way it was [originally posted](https://stackoverflow.com/posts/64683711/revisions), it was already pretty clear they were looking to form strings for sequential unicode codepoints. Their only mistake was thinking they needed utf8 conversion before printing, and that doesn't need a huge answer to explain. – Mike 'Pomax' Kamermans Nov 05 '20 at 16:28
2

You can quite easily do this using the String constructor that takes unicode codepoints as input:

import java.util.Arrays;

public class Main {
  public static void main(String []args){

    // unicode codepoints are hexadecimal, so we specify them using hex notation:
    int start = 0x0621;
    int end = 0x0652;

    // The unicode building version of new String needs an array of ints,
    // even if we're only trying to build a single-letter String.
    int[] data = {0};

    for(int i=start; i<=end; i++) {
      data[0] = i;
      System.out.print( new String(data, 0, 1) );
    }
  }
}

Which generates the output:

ءآأؤإئابةتثجحخدذرزسشصضطظعغػؼؽؾؿـفقكلمنهوىيًٌٍَُِّْ.

(which tries to perform Arabic text shaping because we're using print, not println, but that's not really related to the exercise of turning unicode codepoint numbers into actual strings)

Turning that java-internal String data into an explicitly UTF8 encoded byte sequence is then a trivial one-liner, explained over on How to convert Strings to and from UTF8 byte arrays in Java

Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
  • 1
    `System.out.printf("%c", i)` is another approach, which might or might not be easier to understand. – VGR Nov 04 '20 at 18:15
  • While true, that ignores that the important part is not so much the printing itself, but ending up with a String that you can do further work with. – Mike 'Pomax' Kamermans Nov 05 '20 at 16:06