19

I have an application which (like many others) takes in user input, stores it in a database and then later processes it using (amongst other things) XML tools. The application takes in free text input and like many other developers I am very careful with escaping and quoting so it can handle input containing different types of whitespace, quote characters, reserved XML characters etc.

However, occasionally a user will manage to enter a string containing a vertical tab character (hex 0B) or a form feed (hex 0C). this cannot be processed by XML tools at all and causes the app to barf.

In my application it's quite important to preserve the original input during the 'round trip' process, so i'm loath to just strip out any characters I don't like, especially things like form feed which are still occasionally used in plain text files.

is there any accepted best practice or general strategy for handling these characters when XML processing is involved?

Andy
  • 10,412
  • 13
  • 70
  • 95

2 Answers2

4

Yes, unfortunately some characters are illegal in XML, and have no entity equivalent. As one of those examples, see:

http://www.jdom.org/docs/apidocs.1.1/org/jdom/Element.html#setText(java.lang.String)

which is a String setter... that can throw an exception! Vertical tab is exactly one of those characters for which there is no XML entity, nor a way to "escape" it with XML alone.

I'm working around this myself by using base64 encoding to sanitize strings that might harbor those characters. It's a bit silly, since I have to base64-encode and decode all the time, but I don't think there's a good alternative.

dyoo
  • 11,795
  • 1
  • 34
  • 44
-4

You should escape them using amperstand (� through &#0x1F), then decode/restore them at the end.

See XmlTextWriter incorrectly writing control characters

Community
  • 1
  • 1
Vincent
  • 22,366
  • 18
  • 58
  • 61
  • Then the question makes no sense. If the requirement is to put special invalid characters in the XML (how invalid that may be), escaping will still allow the file to be processed while the edge case of using invalid characters has to be handled by the application itself. Could also use CDATA or any other format. – Vincent Oct 17 '13 at 15:20
  • 4
    Indeed, the question makes no sense. It's another case where the developer is being asked to make up for the fact that the people sending the data don't understand XML. – John Saunders Oct 17 '13 at 17:43
  • 1
    @Vincent There are certain characters that are not allowed in XML documents _even if_ they are escaped as entities. OP mentioned two such characters. `` is not valid XML. – JLRishe Feb 02 '15 at 18:54