-2

I'm having some problems with encoding of my outputs. This is one of the cases:

"<" + this.strName + ">" + strData + "</" + this.strName + ">"
return DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(returnFullTagData(strData).getBytes())).getDocumentElement();

On Netbean's debug is working correctly but when I run the Build version it throws Invalid byte 2 of a 3-byte UTF-8 sequence.

I Solved that problem with:

new String( ("<" + this.strName + ">" + strData + "</" + this.strName + ">").getBytes(), "UTF-8");

BUT I need to change this to work always like the first choise... why?, because this:

When i try to save the new XML file, it saves correctly on netbeans debug:

<kind schema="">Fonología</kind>

But, the build version has the same problem of encoding:

<kind schema="">Fonolog?a</kind>

I think both of this problems has a direct relation but i dont know how.

Of course, i tried to fix this changing the encode of the input data on my XML as the first case but i doesn't work

EDIT

Ok, now that i'm using some of your answers and I'm getting something very interesting.

First case, it was changed for:

strData = "<" + this.strName + ">" + strData2 + "</" + this.strName + ">";
return DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .parse(new InputSource(new StringReader(returnFullTagData(strData))))
                .getDocumentElement();

And it's working nicely, no more ??? (And UnsupportedEncodingException it's not needed anymore, love it).

The second change it's the way it reads the XML base file

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();

        FileInputStream in = new FileInputStream(new File(strBase));
        doc = dBuilder.parse(in, "UTF-8");

But now i have another problem:

<li>Artículo Definido</li>

instead of

<li>Artí­culo Definido</li>

And it's kinda tricky because i'm using two types of nodes in this document and the "String Based" nodes are print correctly, but the "file based" nodes have that problem...

The libraries i'm using are POI, Guava, XMLBeans included with POI and dom4j

PD: Again, it only happens when it's the build version... why it happens?, I'm really tired to try debug and it's basically useless

GunBlade
  • 105
  • 2
  • 12
  • There is enough discussion of this in the wild (Google). Just search for `Invalid byte 2 of a 3-byte UTF-8 sequence` – Jay Harris Jul 16 '15 at 15:58
  • http://stackoverflow.com/questions/2421272/invalid-byte-2-of-2-byte-utf-8-sequence – Jay Harris Jul 16 '15 at 15:59
  • http://stackoverflow.com/questions/11320108/what-does-the-message-invalid-byte-2-of-a-3-byte-utf-8-sequence-mean – Jay Harris Jul 16 '15 at 15:59
  • The problem is not the exception... cuz i'm not receiving this exception anymore. the problem is the second case, when i'm saving/printing the new file it saves ??? instead of áéí – GunBlade Jul 16 '15 at 16:00
  • The ? means that the character is not recognized in the current encoding format – Jay Harris Jul 16 '15 at 16:01
  • Yes, i know, but as i said, the debug version is working correctly, but the Build version is not... Probably have something wrong in the config but i dont know what it is – GunBlade Jul 16 '15 at 16:03
  • Is `in` an `InputStream`? Are you calling [this method](http://docs.oracle.com/javase/8/docs/api/javax/xml/parsers/DocumentBuilder.html#parse-java.io.InputStream-java.lang.String-)? The second parameter is *not* an encoding! But note that if the parser doesn’t guess the encoding from an `InputStream` of a file correctly, you have to check whether it has been correctly *written* to the file in the first place. Note that you can simply use [`parse(File)`](http://docs.oracle.com/javase/8/docs/api/javax/xml/parsers/DocumentBuilder.html#parse-java.io.File-). – Holger Jul 16 '15 at 18:50
  • I edited my question to answer you. Yes, it's an InputStrem. I was thinking that i kinda need a DoocumentBuilder to UTF8 so I based my new version in this http://stackoverflow.com/questions/16400136/why-my-dom-parser-cant-read-utf-8. The option you gave me was my first version too http://stackoverflow.com/questions/31442021/invalid-byte-2-of-a-3-byte-utf-8-sequence-when-i-execute-the-build-project – GunBlade Jul 16 '15 at 18:56
  • So it proves that not all existing answers provide useful information, even if accepted. `UTF-8` is the default encoding assumed for all XML files, unless they contain a declaration specifying a different encoding. So the file might have a wrong declaration in it or it is not properly UTF-8 encoded. The question is *how was it created*? – Holger Jul 16 '15 at 19:06
  • So, what can i do?, I'm using Notepad++ to reformat my XML base file, It can be found here https://raw.githubusercontent.com/GunB/e-Parser/develop/metadata.xml – GunBlade Jul 16 '15 at 19:08
  • I don’t see any special characters in that file. – Holger Jul 16 '15 at 19:10
  • It hasn't, it's a base XML file, an empty one. The data is in a Microsoft Excel file here https://raw.githubusercontent.com/GunB/e-Parser/develop/prueba/esn1le01ob01meta13_07_15.xlsx and it is read by the method public static HashMap turnSheetToObject(XSSFSheet xssSheet) on https://github.com/GunB/e-Parser/blob/develop/src/utiility/ExcelReader.java – GunBlade Jul 16 '15 at 19:17
  • If I understand correctly, you have a process consisting of multiple XML parsing and writing operations and the attempt to fix one problem has introduced even more errors. So it’s important to remove the subsequent errors and identify the place where the initial error happens. So you have to document the steps of the process in your question and the place where the initial error happened. Try to gather as much information as possible, I’ll come back to this question tomorrow… – Holger Jul 16 '15 at 19:20
  • My real question is, Why the debug version works correctly and the Build version doesn't. I feel like i'm "hacking" my own code, and it's really annoying. The Excel file is read correctly because the console that I used to see the data. con = new PrintStream(new utiility.TextAreaOutputStream(this.txtConsole, 400), true, "UTF-8"); shows it. I'm not getting errors from fixing errors, I'm getting errors from a code that works correctly in debug mode and not on release mode – GunBlade Jul 16 '15 at 19:25
  • 1
    As explained several times *your are mixing up the platform’s default encoding and `UTF-8`*. When you run inside Netbeans, it declares the default encoding as being `UTF-8`, despite the operating system having a different encoding. So your mixing has no consequences in debug mode as both encodings happen to be the same. When you run your code in production mode, it will use the real platform’s default encoding which is *not* `UTF-8`. Hence, you are mixing two different encodings then. Stop mixing these two, stop performing obsolete conversions and the problem will disappear. I feel a déjà vu… – Holger Jul 17 '15 at 08:26
  • Just check the output of `System.out.println(Charset.defaultCharset()==StandardCharsets.UTF_8);` inside the debugger and in production mode. Hope you will get enlightenment. – Holger Jul 17 '15 at 08:32
  • I already answer my question, the problem was the Guava library. When I deleted it all the code works correctly in debug and release mode. Thanks for your help – GunBlade Jul 17 '15 at 12:27
  • 1
    The problem is still a mixing of the different character encodings, even when it happens inside the Guava library… – Holger Jul 17 '15 at 13:44

3 Answers3

4

That í is replaced by ? means that there was a conversion from Unicode (java text, String) to bytes using an encoding for those bytes that could not map the letter.

Use String.getBytes(StandardCharsets.UTF_8). (Unless there is a <?xml ...> encoding which differs from UTF-8.)

Avoid s = new String(s.getBytes(), "UTF-8"); which is a kind of hack, work-around, and still has some pitfalls.

For good order:

  • NetBeans IDE, Project Properties / Encoding: UTF-8
  • maven pom.xml: <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

After short eval of project

Nothing suspicious found, try:

public static void printDocument(Document doc, OutputStream out) throws IOException, TransformerException {
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    //transformer.setOutputProperty("omit-xml-declaration", "no");
    transformer.setOutputProperty("method", "xml");
    transformer.setOutputProperty("indent", "yes");
    //transformer.setOutputProperty("encoding", "UTF-8");
    //transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

    //transformer.transform(new DOMSource(doc), new StreamResult(new OutputStreamWriter(out, "UTF-8")));
    transformer.transform(new DOMSource(doc), new StreamResult(out));
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • I just noticed that element but it was not enough, the project already has that option selected on Encoding. – GunBlade Jul 16 '15 at 16:05
  • 1
    Project has no obvious faults (to my quick search). Found one printDocument switching any encoding to UTF-8, but I would try the basic version first. Also change `new byte['È']` to a serious `new byte[4096]`. – Joop Eggen Jul 17 '15 at 08:12
2

When you call getBytes() on a String, you will get the bytes in the underlying platform’s default encoding. When you use the String(byte[]) constructor, you will convert the bytes to a String using the platform’s default encoding.

When you combine these two as in

return new String(("<" + this.strName + ">" + strData + "</" + this.strName + ">").getBytes());

you are performing an obsolete conversion of a String to bytes and back to a String in the best case, i.e. if the platform’s default encoding can handle all characters, and are destroying information, if it can’t. Then, don’t be surprised to see ? instead of these characters.

There is a simple solution at this place, just remove that obsolete conversion:

return "<" + this.strName + ">" + strData + "</" + this.strName + ">";

Of course, now that these characters are not destroyed, they may cause problems at other places where you use the platform’s default encoding when UTF-8 is expected. You may search for all occurrences of conversions between Strings and byte[]s and ensure that all of them use the same encoding, preferably UTF-8, but you may also decide to remove these unnecessary conversions.

If the source is a String of characters, just process them as such:

return DocumentBuilderFactory.newInstance().newDocumentBuilder()
    .parse(new InputSource(new StringReader(returnFullTagData(strData))))
    .getDocumentElement();

no conversions, no data loss…

Holger
  • 285,553
  • 42
  • 434
  • 765
  • But, if i do that, i get Invalid byte 2 of a 3-byte UTF-8 sequence when the build version is executed. I know, i prefer that way, and of course, i'm trying to do it in that way – GunBlade Jul 16 '15 at 18:23
  • 1
    As said, you get invalid sequences because you are doing obsolete conversions between `String`s and `byte`s and mix up the platform’s encoding (which differs between the IDE and the build version) and `UTF-8`. If you remove *all* obsolete conversions, you remove the problem. You won’t solve the problem by adding more obsolete conversions. – Holger Jul 16 '15 at 18:37
-2

Ok, thanks for all your help, was really helpful to solve some problems, not the main one but any improvement it's really appreciated. The problem was the Guava Library but I dont know why it was. I just back to my first version and delete the library; The Release project starts to work correctly like Debug mode. If someone can said why this happens, i'll be much more thankful

GunBlade
  • 105
  • 2
  • 12
  • 2
    You aren't using Guava anywhere in your example code. I guarantee the problem is that you are/were doing something wrong, not a problem with Guava: Guava consistently makes you choose the `Charset` you want to use for encoding/decoding text in all of its APIs that do that. – ColinD Jul 17 '15 at 15:12
  • I was using Guava just in the Join("") method, the ExcelReader was made with that option, but for some reason I prefer use Guava instead of for(a:arr) – GunBlade Jul 17 '15 at 15:23
  • 1
    `Joiner` does absolutely nothing related to encoding, just FYI. – ColinD Jul 17 '15 at 15:24
  • Well, that was the way that I fixed the problem so... you tell me. The problem was not about encoding, It was about some kind of problem with properties in the Build process. – GunBlade Jul 17 '15 at 15:25