2

We're encountering a character encoding issue when reading a UTF-8 query string. An separate outside application is constructs links to our Orbeon application such as:

  • http://localhost:8080/ops/encoding-test/?message=hello%20world
  • http://localhost:8080/ops/encoding-test/?message=it%E2%80%99s%20a%20message

Our application's model reading the query string with the oxf:request processor, and then displaying the string in a view. In the first case above, the application displays "hello world" correctly without problems. In the second test case, %E2%80%99 is the URL encoding for a UTF-8 apostrophe, and causes the application to error with:

2012-09-13 12:21:43,383 ERROR XSLTTransformer  - Error at line 174 of oxf:/config/theme-examples.xsl:
Illegal HTML character: decimal 128
2012-09-13 12:21:43,384 ERROR ProcessorService  - Exception at line 174 of oxf:/config/theme-examples.xsl
; SystemID: oxf:/config/theme-examples.xsl; Line#: 174; Column#: -1
org.orbeon.saxon.trans.XPathException: Illegal HTML character: decimal 128

The error is referencing the %80 in the second byte of the multi-byte encoding of the apostrophe. Note that in the log not only does the theme raise an exception, but the xforms inspector does as well.

It appears like the URL is being decoded as Latin1 instead of UTF-8, as the debug processor lists it???s a message with three characters for the apostrophe. In my research so far, it doesn't appear that HTTP has a way to specify the encoding of the query string itself.

  1. Is there a way to specify the encoding of a query string when read with oxf:request? I didn't see a configuration property for the processor or anything relevant in properties-local.xml that would set a default.
  2. If not, is there a way to force the associated encoding of the string? I suspect this could be done with XSLT, but was unable to find an example. I believe I want something equivalent to ruby's String#force_encoding.
  3. If not, is there any other suggested way to work around the error? My current worst-case hack-fix here is to just strip out any offending characters using mod_rewrite before it hits the servlet.

Any guidance and assistance is appreciated!

(cross posted to ops-users mailing list at http://mail-archive.ow2.org/ops-users/2012-09/msg00033.html)

Gabe Martin-Dempesy
  • 7,687
  • 4
  • 33
  • 24
  • For what it's worth, [RFC 3987](http://www.ietf.org/rfc/rfc3987.txt) specifies that for IRIs, the percent-encoding should represent the UTF-8 form of the character, so your outside application is doing at least a plausible thing. Prior to [RFC 3986](http://www.ietf.org/rfc/rfc3986.txt), however, the definitions of URI did not specify in detail what character encoding should be used for non-ASCII data. In practice, software often uses the HTML page encoding or HTTP headers to guess. Use oxf:request to find out what the `accept-charset` header says. Can you reconfigure the requestor? – C. M. Sperberg-McQueen Sep 13 '12 at 22:41

1 Answers1

3

Orbeon Forms relies on what is returned by the servlet API: see getParameterMap() in ServletExternalContext. So this seems to be something you need to set at the application server level; if using Tomcat, you can do so by adding URIEncoding="UTF-8" on the <Connector>.

Community
  • 1
  • 1
avernet
  • 30,895
  • 44
  • 126
  • 163
  • Adding the `URIEncoding` attribute in tomcat's `server/conf.xml` solved this problem as does the `useBodyEncodingForURI` attribute. Both values are documented at http://tomcat.apache.org/tomcat-7.0-doc/config/ajp.html and covered in a FAQ at http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q2 – Gabe Martin-Dempesy Sep 17 '12 at 16:41