-1

I need to export a SQL Server table into a CSV file, one of the issues is that some of the values in a column could contains special characters, such as new line, which will produce an extra row in the CSV file. In order to fix the issue I have a function to use to remove special characters:

CREATE FUNCTION dbo.udf_tokenize(@input VARCHAR(MAX))
   RETURNS VARCHAR(MAX)
AS
BEGIN
   RETURN (SELECT CAST('<r><![CDATA[' + @input + ' ' + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)'));
END
GO

UPDATE dbo.prospect
SET sNotes = dbo.udf_tokenize(sNotes)

It works fine for most of the cases, except the following case: SNotes column value is:

2600/month gross income (CVS—pharmacy tech—

however, the len(sNotes) returns 169. and when I apply the function to this case, it returns error message:

Msg 9420, Level 16, State 1, Line 11 XML parsing: line 1, character 56, illegal xml character

Please advise how to handle this case. Thanks

Thank you all for your comments. I dig into this issue a bit further during the weekend, and find something new. The actual value inside the column is actually like this : Thank you all for your comments. I dig into this issue a bit further during the weekend. And I notice that the ACTUAL inside the column is `

"2600/month gross income (CVS—pharmacy tech— �1 yr and 5 mo. tenure) with $350 addl in child support. Occupants: myself and 7-yr old daughter. No evictions. No felonies.,"

For some reason, my SSMD only shows the

2600/month gross income (CVS—pharmacy tech—

and truncate the rest part, so the real question is how could I remove that � from the string. Thanks again

`

Yang L
  • 439
  • 3
  • 14
  • Debug it, break down you function into components and see at which step the error happens – Dale K Aug 11 '23 at 21:48
  • I would have just done a nested `REPLACE` to strip out `CHAR(10)` and `CHAR(13)` given [the initial requirement](https://stackoverflow.com/q/76806450/73226) anyway. I don't see any need to round trip it through XML – Martin Smith Aug 11 '23 at 21:55
  • 1
    The text you supplied is 43 characters, not 56 or 169, so we have no idea what data you are casting to xml. this is a particularly hacky method why not just use built-in functions such as Replace or Translate? – Stu Aug 11 '23 at 21:56
  • While asking a question, you need to provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example): (1) DDL and sample data population, i.e. CREATE table(s) plus INSERT T-SQL statements. (2) What you need to do, i.e. logic and your code attempt implementation of it in T-SQL. (3) Desired output, based on the sample data in the #1 above. (4) Your SQL Server version (SELECT @@version;). – Yitzhak Khabinsky Aug 11 '23 at 21:57
  • Presumably in reality your string contains a character like this https://dbfiddle.uk/CGDulI5L – Martin Smith Aug 11 '23 at 22:01
  • 1
    If you are on SQL Server 2017 or later, you can additionally add `TRANSLATE()` function. – Yitzhak Khabinsky Aug 11 '23 at 22:08
  • 1
    Are you sure things like th line break aren't needed in the export? Perhaps the real solution is to ensure your CSV file has quote identified values; then a line break won't mean a new row. – Thom A Aug 12 '23 at 09:18
  • 1
    Have you considered using Powershell and `Export-Csv`? https://stackoverflow.com/a/23976218/73226 – Martin Smith Aug 12 '23 at 09:43
  • I think you need to double back. A good CSV creator will quote columns correctly, so it won't matter if you have newlines etc in your data. – Charlieface Aug 13 '23 at 05:49

1 Answers1

-1

You just need to find out what illegal characters are causing the error.

Our assumption is that there is a NULL character somewhere in the middle.

If needed, you can extend invalid character list by adding them to the @invalidChars variable:

DECLARE @invalidChars VARCHAR(10) = CHAR(0x00) + CHAR(0x08) + ...;

SQL

/*
0. Invalid characters will be translated into spaces.
1. All invisible TAB, Carriage Return, and Line Feed characters will be replaced with spaces.
2. Then leading and trailing spaces are removed from the value. 
3. Further, contiguous occurrences of more than one space will be replaced with a single space.
*/
CREATE FUNCTION [dbo].[udf_tokenize](@input VARCHAR(MAX))
    RETURNS VARCHAR(MAX)
AS
BEGIN 
    DECLARE @invalidChars VARCHAR(10) = CHAR(0x00);
    RETURN (SELECT CAST('<r><![CDATA[' + TRANSLATE(@input, @invalidChars, REPLICATE(SPACE(1), LEN(@invalidChars))) + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)'));
END
GO


DECLARE @input VARCHAR(MAX) = '2600/month gross income (CVS—pharmacy tech—' + CHAR(08);

SELECT dbo.udf_tokenize(@input);
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21
  • 2
    What is the point of this? Rather than strip out the two characters they don't want they now need to strip out every possible character that is invalid in XML purely so they can use this XML hammer for something that isn't a nail? – Martin Smith Aug 12 '23 at 09:27
  • It is a follow up question: https://stackoverflow.com/questions/76806450/sqlcmd-single-row-product-multiple-lines – Yitzhak Khabinsky Aug 13 '23 at 01:51