Output and preg to Unicode Strings in Java

Question

I have normal String property inside an object, containing accented characters. If I debug the software (with Netbeans), into the variables panel I will see that string in the right way:

But when I'm going to print out the variable with System.out.println I will see strange things:

As you can see every "à" become "a'" and so on, and this will lead to a wrong character count, even in Matcher on the string.

How I can fix this? I need the accented characters, to have the right characters count and to use the matcher on it. I tried many ways but is not going to work, for sure I'm missing something.

Thanks in advance.

EDIT

EDIT AGAIN

This is the code:

public class TextLine {
    public List<TextPosition> textPositions = null;
    public String text = "";
}

public class myStripper extends PDFTextStripper {

    public ArrayList<TextLine> lines = null;

    boolean startOfLine = true;

    public myStripper() throws IOException
    {
    }

    private void newLine() {
        startOfLine = true;
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        newLine();
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        newLine();
        super.writeLineSeparator();
    }

    @Override
    public String getText(PDDocument doc) throws IOException
    {
        lines = new ArrayList<TextLine>();
        return super.getText(doc);
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
            TextLine tmpline = null;

            tmpline = lines.get(lines.size() - 1);
            tmpline.text += getWordSeparator();
            tmpline.textPositions.add(null);

        super.writeWordSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        TextLine tmpline = null;

        if (startOfLine) {
            tmpline = new TextLine();
            tmpline.text = text;
            tmpline.textPositions = textPositions;
            lines.add(tmpline);
        } else {
            tmpline = lines.get(lines.size() - 1);
            tmpline.text += text;
            tmpline.textPositions.addAll(textPositions);
        }

        if (startOfLine) {
            startOfLine = false;
        }

        super.writeString(text, textPositions);
    }
}

Isn't that just a display issue in your console? The string still should contain the right chars. — Wiktor Stribiżew, Sep 13 '17 at 09:05
Possible duplicate of [how do I set the default locale for my JVM?](https://stackoverflow.com/questions/8809098/how-do-i-set-the-default-locale-for-my-jvm) — Colonder, Sep 13 '17 at 09:05
No, because even if i do a string.length the count is wrong (is counting the extra accent) — Samuele Diella, Sep 13 '17 at 09:06
System.out will use the operating system encoding. By the way `à` could be one char or two chars: `a` + zero-width combining diacritical mark (accent). Output to a file with UTF-8 encoding. — Joop Eggen, Sep 13 '17 at 09:11
Is not only about System out, is even about String.length. I've changed the post with full view. The correct characters count is 65, not 67 — Samuele Diella, Sep 13 '17 at 09:15
@SamueleDiella What's your default encoding? And how do you set the string property? Can you show us some code? — MC Emperor, Sep 13 '17 at 09:27
Ok just checked everything: `samuele@samuele-Inspiron-7737:~$ locale charmap UTF-8 samuele@samuele-Inspiron-7737:~$ echo $LANG it_IT.UTF-8` i'm preparing the code — Samuele Diella, Sep 13 '17 at 10:21
i posted the code. @JoopEggen there is not a way to disable such behaviour? — Samuele Diella, Sep 13 '17 at 11:02
@Colonder i don't think is a duplicate, i tryed to set the locales, but still the characters are wrong (characters count too) `System.out.printf("%s %s", Charset.defaultCharset(), Locale.getDefault()); ` -> output is `UTF-8 it_IT` — Samuele Diella, Sep 13 '17 at 11:10

MC Emperor · Accepted Answer · 2017-09-13T12:32:22.047

It is about the representation of certain Unicode characters.

What is a character? That question is hard to answer. Is à one character, or two (the a and ` on top of eachother)? It depends what you consider to be a character.

The accent graves (`) you are seeing are actually combining diacritical marks. Combining diacritical marks are separate Unicode characters, but are combined with the previous character by many text processors. For instance, java.text.Normalizer.normalize(str, Normalizer.Form.NFC) does such a job for you.

The library you are using (Apache PDFBox) possibly normalizes the text, so diacritics are combined with the preceding character. So in your text, some TextPosition instances contain two code points (more precisely, e` and a`). So the length of the list with TextPosition instances is 65.

However, your String, which is in fact a CharSequence, holds 67 characters, because the diacritic itself takes up 1 char.

System.out.println() just prints each character of the string, and that is represented as "dere che Geova e` il Creatore e Colui che da` la vita. Probabilmen-"

Then why is the Netbeans debugger showing "dere che Geova è il Creatore e Colui che dà la vita. Probabilmen-" as value of the string?

That is simply because the Netbeans debugger displays the normalized text for you.

The java.**text**.Normalizer can be used for both Composing and Decomposing accented letters, that can be a single combined letter, or basic latin letter plus zero-width accents. Probably the string was in decomposed form, and System.out substituted zero-width accents by real accents or suchl. Hence this is probably the solution. — Joop Eggen, Sep 13 '17 at 12:29
very nice, thanks a lot, this is the solution!! You saved my journey ;-) Thanks to everyone who helped, a lot! — Samuele Diella, Sep 13 '17 at 14:00

Output and preg to Unicode Strings in Java

1 Answers1