0

I have an application that gets a XML back from an API, the XML is saved in a column in a MS SQL database. This XML is used to render a report.

My problem is that MS SQL Server (2014 - 2022) seems to have problems with these special chars in XML tags. To test this simply use this:

DECLARE @xml as XML
--SET @xml = '<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>'
SET @xml = '<?xml version="1.0" encoding="UTF-8"?><test>this works</test>'
select  @xml.value('(/test)[1]','nvarchar(16)') as Test
FROM @xml.nodes('//*[text()]') AS x(y)

The first SET doesnt work - invalid xml symbol. The second SET works.

The normal .net XML stuff has no problems with this characters neither does the API which gives me this back.

Currently I string.replace these characters in .net but there are always more new ones coming. When there is a new invalid char it breaks the process and I have to add it into the string replace logic, which means recompiling the application.

I can't whitelist a-zA-Z because I need to replace É -> E and Ð -> D and so on.

Even CDATA-encasing doesn't help.

halfer
  • 19,824
  • 17
  • 99
  • 186
Roli
  • 1
  • 5
  • Try: SET @xml = N'this should work É ó' – siggemannen Apr 21 '23 at 10:15
  • Sadly it doesnt work. Error message is that i cant chance the coding because nvarchar->xml – Roli Apr 21 '23 at 10:21
  • 3
    you should remove the – siggemannen Apr 21 '23 at 10:24
  • Because your XML is UTF-8 you need to also ensure it's *stored* in UTF-8 in SQL Server. If you run the above in a database using a UTF-8 collation *both* work fine. UTF-8, however, is only supported in SQL Server 2019+. – Thom A Apr 21 '23 at 10:29
  • And the script you are executing should be encoded in utf-8 as well! – Aconcagua Apr 21 '23 at 10:30
  • @ThomA Would utf-8 data stored within non-utf-8 MS SQL server disturb at all? It should see them just as some 'strange' characters as we humans do if reading utf-8 in an editor using another encoding... The other way round might be more critical as arbitrary upper range characters (>127) likely produce invalid utf-8 sequences. – Aconcagua Apr 21 '23 at 10:33
  • *"in a column in a MS SQL database"* Side note, databases don't have columns; columns are a property of a table. – Thom A Apr 21 '23 at 10:33
  • The problem is the encoding, @Aconcagua . If we take the OP's string, `'this doesn't work É ó'`, and save it as a UTF-8 file, and then open it in ANSI-1252 then we get `'this doesnt work É ó'` which is a problem, because some of those characters need escaping. – Thom A Apr 21 '23 at 10:36
  • @ThomA I'm aware of this interpretation – but which character would indeed need escaping? I'm not aware any of these having a special meaning? What do I miss? – Aconcagua Apr 21 '23 at 11:08
  • Does this answer your question? [SQL - UTF-8 to varchar/nvarchar Encoding issue](https://stackoverflow.com/questions/56177597/sql-utf-8-to-varchar-nvarchar-encoding-issue) – GSerg Apr 21 '23 at 12:55

1 Answers1

2

The problem here is that you are using a UTF-8 value, but you are not in a collation that is using UTF-8. As a result you get an error as when the value is read it isn't valid any more. Presumably you are using a Latin collation, which means your varchar value is read using ANSI-1252 for the code page; this means that the UTF-8 value can't be read properly and is causing an error on characters like É and ó (as they have different values in the ANSI-1252 vs UTF-8).

The obvious answer would be that if you need to work with UTF-8 values then work in a UTF-8 environment. That means creating a UTF-8 database though; this could be a problem if you have an existing application and also as you are using 2014-2017 as well can be ruled out, as they don't have UTF-8 support.

Another method would be to remove the declaration of the UTF-8, then your value will be read using the code page of the database:

DECLARE @xml as XML;
SET @xml = REPLACE('<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>',' encoding="UTF-8"','');
select @xml.value('(/test/text())[1]','nvarchar(30)') as Test;

This, however, will be an issue if you have any characters that appear outside of the ANSI-1252 code page and presumably you do, as why use UTF-8 otherwise?

Finally, another method could be to replace the UTF-8 with UTF-16. Note that you will need to also ensure that your literal string is then an nvarchar too, as otherwise you'll get an error:

DECLARE @xml as XML;
SET @xml = REPLACE(N'<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>','"UTF-8"','"UTF-16"');
select @xml.value('(/test/text())[1]','nvarchar(30)') as Test;
Thom A
  • 88,727
  • 11
  • 45
  • 75
  • Thank you for the answer. In hinsight its so obvious. UTF-8 is coming from the API. As you guessed correctly the database cannot be changed. Your replace command is the solution! The XML is stored in a text column. I will try a bit if removing the encoding or replacing it with UTF-16 is the better approach. – Roli Apr 21 '23 at 16:32
  • In a `text` column? That data type has been deprecated since **2005**; you need to stop using it. Use `xml` to store XML. – Thom A Apr 21 '23 at 17:15
  • My problem is that this text column is used to save different API responses. Some are XML some JSON some are CSV and some are even old EDI txt files. – Roli May 24 '23 at 12:18
  • Sounds like you might need multiple columns. `text` wasn't the correct choice in 2005, it's the completely wrong choice in 2023, @Roli . – Thom A May 24 '23 at 13:03
  • Because most processes uses 1-2 stages i have 2 columns for data that was send to APIs and 2 columns for receiving data. I dont see the advantage by using 2+ for each of them + necessary changes in the code. Its planned to change the columns to varchar(max) but currently some of the data can get into the megabyte range. – Roli May 25 '23 at 17:35