1

I am working on a project that deals with foreign languages. I have this String on Java:

String string = "áçñéüöëéóíóíóíóíííããéíáéíáççãÓłńńāņšøøøøééèèÜüÜüééíéáéáříříççááññïïššäääééèèááéáéáéáéáéáéáèèèèííéèéèáééÇÇééééííüüüüííøøáááá¿¿ííóé̌Íá̌íáææööíÁíÁíííłççññá璇üşİüşİöğöğşşııããßßôèôèêééççáÁáÁééééééÇóóéíêööééííððññáñáñÓúÓúíłńłńååéééëëááéí¿¿ééÖÖáéáéöğÖüöğÖüçŞçŞııçııçııİİşİşíáíáéüüÉÉéééøññïíéé";

and I have saved my java file in utf-8 encoding.

I want to remove duplicated character, then sort characters by their unicodes, and finally print out the result string and save the string into a text file (in UTF-8 or other unicode).

I don't know if it is because of the terminal - I am working on Eclipse (Windows) and I see '?'(question mark) when printing some of the characters. What is the correct way to print the string?

I am also not sure how to SAFELY remove duplicated characters and sort the characters. For example, if I use String.charAt() and HashSet<Character>, is it safe to do so in my case? Will I get half a character for some multi-byte character? What is a safe way to compare these characters?

Knowing that the project may deal with a very large variety of different languages, what is a safe way to save the string into text file?


Update: To reproduce the question mark problem:

String str = "¿æŁéİüłïąņąø";
System.out.println(str);

It prints out this on my Eclipse console:

¿æ?é?ü?ï???ø

Note: I am already using GNU FreeMono for the console font, which has very good foreign character cover.

user2526586
  • 972
  • 2
  • 12
  • 27
  • `String.charAt` and `Character` operates on UTF-16 code units, so as long as you read the file correctly, and there are no characters outside the BMP, you won't get "half a character". I'm surprised I can't find a duplicate... Maybe I'm not using the right words... – Sweeper Sep 01 '22 at 12:12
  • 1
    Note that the notion of "character" might not be as simple as you think (often a character is represented by one codepoint, but that's not always the case). And neither is the notion of "duplicate" (some characters can be represented in multiple ways that should be considered equivalent). – Joachim Sauer Sep 01 '22 at 12:17
  • I'd suggest that you provide a [mcve] for that "question marks" problem, if you that's what you are actually asking about. Otherwise, you should remove that part to make your question more focused. – Sweeper Sep 01 '22 at 12:20
  • Given no 'complications' such as come from the increasingly imprecise nature of what defines a character, which has been discussed by others like @Joachim Sauer here, you can de-duplicate and order like this: ```TreeSet orderedChars = string.chars().mapToObj(c -> (char)c).collect(Collectors.toCollection(TreeSet::new)); ``` As far as printing is concerned, as well as having encoding sorted out, you also need a font that has the glyphs for representing the characters – g00se Sep 01 '22 at 15:13
  • You are asking multiple, unrelated questions, which means it may be closed (_"This question currently includes multiple questions in one. It should focus on one problem only."_). Rendering to the console has nothing to do with sorting characters, nor removing duplicate characters. Please amend your question so that it focuses on a single issue, and create additional new question(s) if appropriate. – skomisa Sep 03 '22 at 02:32

2 Answers2

2

Characters in running Java programs are intrinsically Unicode (they are in fact stored as UTF-16, which you can ignore until you're interested in codes U+10000 or greater, which you're probably not at this point - but if you are, look at the 'codepoint' operations).

A String is thus automatically a string of Unicode characters.

Java source code is generally interpreted as UTF-8; this may be alterable by local convention, I'm not sure of that, since I'm a "UTF-8 only" person.

So what this boils down to is that you don't have to do anything special in a Java program to "use Unicode" - it just is.

You may need to pay attention to cases where you read and write Strings to some external medium, like a disk file or a network connection. There is a conversion to a byte stream - typically UTF-8 by default, though the default can be changed by local convention. You can explicitly specify the byte encoding in most contexts.

Your remaining problem seems to be related to display on Windows. That appears to be a font issue; you need a font containing the characters. Or, since it's Windows, it may be a matter of selecting the right "code page".

  • While Java goes pretty far in making Unicode "just work", it doesn't lift the burden of having an understanding of what a "character" is (and more important: is not), which can be really hard to define in Unicode. Characters can be represented pre-composed or using combining marks, for example. – Joachim Sauer Sep 01 '22 at 12:18
  • True, and perhaps this case ("dealing with foreign languages") is going to have to deal with such nuances. – access violation Sep 01 '22 at 12:26
2

When calling System.out.println(str), the charset used by the underlying PrintStream (i.e. System.out) is your default charset, and if that is not UTF-8 then you might have problems when rendering in the Eclipse console. From the javadoc for PrintStream, with my emphasis added:

All characters printed by a PrintStream are converted into bytes using the given encoding or charset, or platform's default character encoding if not specified.

So your console output is probably not working because your "platform's default character encoding" is not UTF-8. There are two simple coding approaches to resolve that:

  • Call java.lang.System.setOut() so that System.out uses UTF-8.
  • Create your own PrintStream that uses UTF-8 instead of using System.out.

Here's code which reproduces your problem, and resolves it:

package pkg;

import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;

public class Main {

    public static void main(String[] args) throws IOException {
        
        String str = "¿æŁéİüłïąņąø"; // Sample data from the question.
        
        System.out.println("1: " + str); // Fails if default charset is not UTF-8.  

        // Redirect System.out to use a PrintStream using UTF-8 charset.
        FileOutputStream fos2 = new FileOutputStream(FileDescriptor.out);
        PrintStream ps2 = new PrintStream(fos2, true, StandardCharsets.UTF_8);
        System.setOut(ps2);
        System.out.println("2: " + str); // Works.
        
        // Use your own PrintStream with UTF-8 charset instead of using System.out.
        FileOutputStream fos3 = new FileOutputStream(FileDescriptor.out);
        PrintStream ps3 = new PrintStream(fos3, true, StandardCharsets.UTF_8);
        ps3.print("3: " + str); // Works.
        ps3.close();
    }
}

This is a screen shot of the Eclipse console output from running that code, which demonstrates that both solutions described above work:

Eclipse console

Notes:

  • My environment was Eclipse 2022-06 (4.24.0) with JDK 11.0.12 on Windows 10, with windows-1252 as the default charset and Consolas as the console font.
  • Presumably some (but not all) of the characters in your sample data rendered correctly because your "default charset" supported some (but not all) of those characters. None of the characters rendered correctly when using my default charset (windows-1252).
  • An alternative approach would be to change your platforms default encoding to UTF-8, so that System.out.println(str) would automatically encode using UTF-8, but that would mean your code is not portable.
  • The question Java JDK 18 in IntelliJ prints question mark "?" when I tried to print unicode like "\u1699" is relevant, though it focuses on println() issues with JDK 17/18 on Intellij IDEA.
skomisa
  • 16,436
  • 7
  • 61
  • 102