Encoding issue in I/O with Jena

Question

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.

When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....

Then, I use the RDF writer to output the file:

Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional  
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text

The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.

By default the text should be encoded to utf-8. The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu g√©n√©ralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis.... (Either way, this is obviously not acceptable from the user's viewpoint). The same issue happens with any output format supported by Jena (RDF, NT, etc.).

I can't really find a logical explanation to this. The official documentation doesn't seem to address this issue.

Any hint or tests I can run to figure it out?

It could be that the file is written as utf-8, but vim and firefox are reading them as some other encoding. Here's how you can specify your [output encoding in vim](http://stackoverflow.com/questions/778069/how-can-i-change-a-files-encoding-with-vim). — Kale McNaney, Oct 01 '12 at 18:53
Hmm, The unicode binary \u221A \u00A9 represents [the square root symbol √](http://www.unicodemap.org/details/0x221A/index.html) and [the copyright symbol ©](http://www.unicodemap.org/details/0x00A9/index.html), respectively. The unicode binary for [the e with acute - é](http://www.unicodemap.org/details/0x00E9/index.html) is \u00E9 so it does appear the file has been written incorrectly... — Kale McNaney, Oct 01 '12 at 19:07
For reference the latest Jena documentation is now at jena.apache.org - the specific documentation you refer to is at http://jena.apache.org/documentation/io/iohowto.html#character-encoding-issues — RobV, Oct 01 '12 at 20:24

cygri · Accepted Answer · 2012-10-02T03:18:45.703

My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.

You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.

You're also not showing how you're printing the strings in the printStringFromModel() method.

Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?

Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.

I'm printing with log4j. The source strings are all stored in utf-8, and I can visualise them properly on any editor. When I change encoding from utf-8 to other encodings in Firefox/vim nothing changes. — Mulone, Oct 01 '12 at 20:56

score 1 · Answer 2 · edited May 23 '17 at 11:48

My hint/answer would be to inspect the byte sequence in 3 places:

The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.

This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.

score 1 · Answer 3 · answered Oct 01 '12 at 21:33

1

The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.

answered Oct 01 '12 at 21:33

Ian Dickinson

12,875
11
40
67

Encoding issue in I/O with Jena

3 Answers3