1

When a form in my app is submitted, it is converted (on the client-side) to a string of HTML that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head> 
  <style type="text/css">
    td { white-space: normal; }
  </style>
</head>
<body>
<table>
    <tbody>
    <!-- Repeat for every field in the form -->
    <tr>
        <td>Name</td>
        <td>John Doe</td>
    </tr>
    </tbody>
</table>
</body>
</html>

As part of the conversion process, each field value is sanitized (by Angular's $sanitize service) to remove any <script> tags, etc.

On the server, I normalize/clean the HTML, then use the flying saucer Java library to convert this XML/CSS to a PDF.

To test the form I have a tool that bootstraps the fields with random values. This tool frequently bootstraps the fields with weird unicode characters that cause the PDF converter to fail, because they are not considered valid XML characters.

One such value is described below:

How the value appears when inspected in the browser

> $('input[name="postcode"]').val();
< "h5    9gx"

> encodeURI($('input[name="postcode"]').val());
< "h5%E2%80%82%0B%E2%80%A9%E2%80%89%E2%80%A9%E2%80%82%E2%80%88%0B9gx"

In the browser it looks like "h5" and "9gx" separated by a few spaces, but they are definitely not spaces

How the value appears when inspected on the server

Raw HTML value

<td>h5&#8194;&#11;&#8233;&#8201;&#8233;&#8194;&#8200;&#11;9gx</td>

After normalizing/cleaning the HTML it looks like the XML entities in the raw HTML have been converted to spaces, but again, they're definitely not spaces.

Whatever they are, they cause the XML parser to throw this exception

SAXParseException; An invalid XML character (Unicode: 0xb) was found in the element content of the document.

How can I safely remove/replace/sanitize/encode these values either on the client or server-side?

Dónal
  • 185,044
  • 174
  • 569
  • 824

1 Answers1

4

0xb (aka vertical tab) is not an allowed character in XML :

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Therefore your data is not XML, and any conformant XML processor must report an error such as the one you received.

You must repair the data by removing any illegal characters by treating it as text, not XML, manually or automatically before using it with any XML libraries.

See also:

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks for your answer, do you have any suggestion for how I might do this with JS/Angular on the client-side or any Java lib on the server? – Dónal May 22 '17 at 20:16
  • @Dónal: Answer updated with links to other Q/As showing how to filter illegal XML characters from strings via Java or JavaScript. – kjhughes May 22 '17 at 20:42