When a form in my app is submitted, it is converted (on the client-side) to a string of HTML that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<style type="text/css">
td { white-space: normal; }
</style>
</head>
<body>
<table>
<tbody>
<!-- Repeat for every field in the form -->
<tr>
<td>Name</td>
<td>John Doe</td>
</tr>
</tbody>
</table>
</body>
</html>
As part of the conversion process, each field value is sanitized (by Angular's $sanitize
service) to remove any <script>
tags, etc.
On the server, I normalize/clean the HTML, then use the flying saucer Java library to convert this XML/CSS to a PDF.
To test the form I have a tool that bootstraps the fields with random values. This tool frequently bootstraps the fields with weird unicode characters that cause the PDF converter to fail, because they are not considered valid XML characters.
One such value is described below:
How the value appears when inspected in the browser
> $('input[name="postcode"]').val();
< "h5 9gx"
> encodeURI($('input[name="postcode"]').val());
< "h5%E2%80%82%0B%E2%80%A9%E2%80%89%E2%80%A9%E2%80%82%E2%80%88%0B9gx"
In the browser it looks like "h5" and "9gx" separated by a few spaces, but they are definitely not spaces
How the value appears when inspected on the server
Raw HTML value
<td>h5 
 
  9gx</td>
After normalizing/cleaning the HTML it looks like the XML entities in the raw HTML have been converted to spaces, but again, they're definitely not spaces.
Whatever they are, they cause the XML parser to throw this exception
SAXParseException; An invalid XML character (Unicode: 0xb) was found in the element content of the document.
How can I safely remove/replace/sanitize/encode these values either on the client or server-side?