2

I've been using, for example, the degree character entity ° in my source xml and it was always output as ° after translation and worked fine. However, I've recently had to switch from a xalan processor to saxon and now the character is being output as an actual degree character (°) in the html and the browser is rendering it as ¬∞.

I'm not really sure why it worked in xalan but I was searching around and thought character maps would be the solution from what I found in this page:

http://www.xmlplease.com/xmltraining/xslt-by-example/examples/character-map_1.html

But when I do the same thing it just appears to be ignored and I still see the °.

Again, I'm using saxon9 with the xslt task in ant with java6. I'd like my ° character in xml be preserved (or changed to °) when translating to html. Any suggestion?

rjcarr
  • 2,072
  • 2
  • 22
  • 35
  • Preserving character entities and general entities is a major pain. I also posted this question a while back: http://stackoverflow.com/questions/5985615/preserving-entity-references-when-transforming-xml-with-xslt Good luck! – Daniel Haley Oct 25 '11 at 07:04
  • DevNull: Yeah, I saw your post, but it was on text entities and I'm just dealing with characters. You say you've got character entities to be preserved using character maps? Is there anything in the link I gave that is missing because I tried that and it didn't work. – rjcarr Oct 25 '11 at 07:18

2 Answers2

3

It looks like the new output is not marked as UTF-8?

Most often, when one character becomes two, it's because you send UTF-8 to the browser saying it's another encoding (i.e. ISO-8859-1, win-1512, etc.). Putting UTF-8 encoding in the HTML header may not be enough. You probably also need to put it as a header in the HTTP reply.

Using ° should not help if the XSLT parser transforms all the entities.

Otherwise, there may be a flag you can set to avoid the translation of entities?

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
  • Thanks for the response, I try to keep utf-8 throughout, but I'm not 100% sure. At the top of the page I have ` ` and then I have `` in the head. I can't tell for sure what is in the response though, I'll work on that. – rjcarr Oct 25 '11 at 06:54
  • Also, the browser is reporting the encoding as UTF-8. I changed to an html5 doctype ` ` and it didn't make a difference. And just to be clear, if I view the source of the html it shows the degree character (°) but shows in the rendered html as ¬∞. – rjcarr Oct 25 '11 at 07:05
  • Turns out the problem was related to encoding. Thanks for helping me out! – rjcarr Oct 25 '11 at 08:05
  • 1
    @rjcarr Can you explain what the problem was please? How was it related to encoding? I think adding that detail will help others who find this answer. – Jeff Yates Nov 05 '12 at 20:44
  • @Jeff, what I meant was the HTTP request header: `Content-Type: text/html; charset=utf-8` so that way you enforce the encoding directly in the HTTP. Whether rjcarr fixed it that way, I do not know, but if that HTTP charset was set to say ISO-8859-1, then the browser would not use UTF-8 anyway. – Alexis Wilke Nov 06 '12 at 06:30
  • 1
    @AlexisWilke Thanks, I understood what you meant. I'm intrigued by what actually fixed the issue which unfortunately, the OP did not choose to share (at least not unambiguously). – Jeff Yates Nov 06 '12 at 14:16
  • @JeffYates Sorry, I don't remember the details, but I think it was a combination of changing the doctype as well as the charset in all of the pages. – rjcarr Nov 07 '12 at 23:50
2

You can't force the input entities to be preserved, but you can ensure that any non-ASCII characters are output as entity or character references by using output encoding="us-ascii".

The fact that your browser doesn't display the degree sign correctly means that the document is being served with the wrong encoding. Using us-ascii is a workaround for this, but it doesn't solve the underlying problem which is that there's something wrong in your configuration somewhere (it can be hard to find out where).

I don't know why your character maps are ignored. Assuming you've coded it correctly, the most likely reason is that the serialisation isn't being done by the XSLT processor but by something else: for example, you might be transforming to a DOM and then serialising the DOM.

You can get more control over how Saxon serialises special characters with the HTML output method using saxon:character-representation - see http://saxonica.com/documentation/extensions/output-extras/character-representation.xml

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • 1
    Thanks for the response. It turns out the maps weren't being ignored but just not preserving the entity, i.e., it was still writing the character in the html and not the entity. Anyway, you are right, the problem is with my encoding. I'm putting a bunch of different pages together and it turns out not all of them were correctly encoded (but as you said, it was hard to tell). Sorry, the previous answer also said this, so I have to give him credit. Thanks so much for the time, though! – rjcarr Oct 25 '11 at 08:04
  • Don't worry, I'm not one of the people who cares about earning brownie points on this site. In fact, I regard all the privileges and points as absurdly childish. – Michael Kay Oct 25 '11 at 22:38