2

I'm trying to parse an UTF-16 encoded document using Apache Camel Splitter with xtokenize, this delegates to Woodstox (com.ctc.wstx.sr.BasicStreamReader), also I cannot know the encoding of a file before I read it, currently some files are UTF-16, others UTF-8:

.split().xtokenize(getToken(), 'w', NAMESPACES)

The problem I encounter is that Camel tells Woodstox which encoding to use:

String charset = IOHelper.getCharsetName(exchange);

It sets the default UTF-8 as encoding, so BasicStreamReader tries to read BOM bytes as UTF-8 and fails with

com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '�' (code 65533 / 0xfffd) in prolog; expected '<'

As specified in https://www.w3.org/TR/xml/#sec-guessing XML Parser (Woodstox) should be able to autodetect the file encoding if only Camel lets it do the work.

Is there a way not to implement the encoding detection myself?

antidote
  • 127
  • 1
  • 12

2 Answers2

0

Okay I can see the current source code will fallback and use the platform encoding. So your use-case with the encoding provided in the XML stanza is not supported.

I am not sure if Camel really need to fallback to a default platform encoding as it uses the java.util.Scanner in the splitter, and it supports scanning without using a specific encoding.

Maybe you can try to patch the source code in the XMLTokenExpressionIterator and test it locally for you, and report back here.

We can then likely take a look at make it optional in Apache Camel to use the fallback encoding or not.

And in your current version of Apache Camel you can always extend XMLTokenExpressionIterator and override the doEvaluate method and then call the createIterator method without a charset parameter. And then use your custom iterator with the Camel splitter.

Claus Ibsen
  • 56,060
  • 7
  • 50
  • 65
  • the fallback is not platform encoding: it's: String DEFAULT_CHARSET_PROPERTY = "org.apache.camel.default.charset"; when defined, otherwise UTF-8, so it's UTF-8 by default. But this is better than relying on "random" platform encoding – antidote Sep 24 '17 at 11:13
  • I could pinpoint the issue: Camel's XmlTokenExpressionIterator uses a Reader (RecordableReader), thus saying it knows the charset of the input. So Woodstox (for example) would use ReaderBootstrapper as opposed to StreamBootstrapper. See javadoc of ReaderBootstrapper: `Input bootstrap class used when input comes from a Reader; in this case, encoding is already known, and thus encoding from XML declaration (if any) is only double-checked, not really used.` StreamBootstrapper is able to recognize declared encoding. – antidote Sep 26 '17 at 09:33
  • I've patched `XMLTokenExpressionIterator` easily replacing `RecordableReader` with `RecordableInputStream` from the same package. createXMLStreamReader(in) is deprecated, the other method wants to have the exchange to choose a charset, but you can fix it easily. Now it fails in the next step inside XmlConverter, where I apply xslt to extracted string for the same reason: `XMLStreamReader r = new StaxConverter().createXMLStreamReader(new StringReader(source))` It does not know it's UTF-16. I think the detected encoding should be set on EXCHANGE for the next steps by Camel. – antidote Sep 26 '17 at 13:42
  • I'm going to create a bug report for it, it needs to be fixed in several places... – antidote Sep 26 '17 at 14:06
  • hmmm there is a comment: `// woodstox's getLocation().etCharOffset() does not return the offset correctly for InputStream, so use Reader instead. ` which says that InputStream does not work? – antidote Sep 26 '17 at 18:25
0

Created a Camel JIRA ticket: https://issues.apache.org/jira/browse/CAMEL-11846 From my comments you can see there is no easy solution for splitting UTF-16 XML with Camel without knowing it's UTF-16 in advance.

Though subclassing XMLTokenExpressionIterator, which is an ExpressionAdapter and switching to InputStream works in the first place, there are several other places with xslt & xpath & conversion to StaxSource where it will break for the same reason.

As a workaround I consider it's easier to let XmlStreamReader find out encoding in advance (happens at the initialization) and setting Exchange.CHARSET_NAME header or property.

antidote
  • 127
  • 1
  • 12