0

I am facing an error while running xquery on an XML file. Actually there are some elements in XML which have unicode characters along with the data.

      "+30 2222032000",
      "+30 6973222259\u001f"

I tried using replace and remove functions, but I am not sure what all unicode characters can come in my source file. Is there any generic method where i can remove all of these characters

Thanks

Würgspaß
  • 4,660
  • 2
  • 25
  • 41
ankit
  • 61
  • 5
  • 1
    It is not clear which characters you do consider as Unicode characters and which you don't. XML/XQuery only works with Unicode characters so everything in your string is an Unicode character. – Martin Honnen Feb 26 '19 at 09:22
  • Thanks Martin, I do not want "\u001f" these type of characters which starts with '/u' – ankit Feb 26 '19 at 09:43
  • 1
    @ankit That charcter is a unit separator which is a control character. There are a lot of control characters defined in Unicode so find/remove them manually is error-prone and tedious. You should find out which encoding the application that produced the XML used and fix it there. – Würgspaß Feb 26 '19 at 09:54

1 Answers1

4

Every character in an XML document is a Unicode character, if there were non-Unicode characters then you really would have problems.

Your actual problem is that the document uses an encoding of Unicode characters, "\u001f", which XML parsers do not recognise. It's perfectly legal XML content, it's just that this is going to be treated as a sequence of 6 characters starting with a backslash, not as a representation of the control character x1F (which, as it happens is not a character that XML 1.0 permits).

One way of dealing with these characters would be to treat "+30 6973222259\u001f" as a JSON string and use the XQuery 3.1 function json-to-xml() to convert it to XML (it needs to have the enclosing quotes). However this will give you problems if there are escape sequences that convert to characters which XML doesn't permit, such as \u0000. The json-to-xml() function has various options for dealing with such situations.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0 some control characters should not be included in xml. – Giacomo Catenazzi Feb 26 '19 at 15:26