Ran into the following exception parsing XML generated from inputs:
org.xml.sax.SAXParseException: Zeichenreferenz "&#
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
I traced the problem down to an input string containing the character 0x1f
, an invisible "UNIT SEPARATOR" character: http://www.columbia.edu/kermit/ascii.html
I had to copy the input into a text file to make it visible:
Tested the input-string in other places and also ran into problems like:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: XML parsing: line 1, character 149, illegal xml character
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)
What would be the best way to strip such characters from an input string, are there other problematic characters for XML which should be removed?