14

We recently migrated from SQL Server 2012 to SQL Server 2014 and all our FOR XML code started throwing errors about non-printable ASCII characters. I wrote this horrible function to remove non-printable ASCII characters as a quick fix. I want to replace it with something cleaner. Is there a way to do this?

ALTER FUNCTION [dbo].[remove_non_printable_chars]
(@input_string nvarchar(max))
RETURNS nvarchar(max)
BEGIN
    RETURN
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(@input_string,
        CHAR(1), ''),CHAR(2), ''),CHAR(3), ''),CHAR(4), ''),CHAR(5), ''),CHAR(6), ''),CHAR(7), ''),CHAR(8), ''),CHAR(9), ''),CHAR(10), ''),
        CHAR(11), ''),CHAR(12), ''),CHAR(13), ''),CHAR(14), ''),CHAR(15), ''),CHAR(16), ''),CHAR(17), ''),CHAR(18), ''),CHAR(19), ''),CHAR(20), ''),
        CHAR(21), ''),CHAR(22), ''),CHAR(23), ''),CHAR(24), ''),CHAR(25), ''),CHAR(26), ''),CHAR(27), ''),CHAR(28), ''),CHAR(29), ''),CHAR(30), ''),
        CHAR(31), ''), NCHAR(0) COLLATE Latin1_General_100_BIN2, '')
END

Here's the FOR XML code that broke. (I did not write this. It was already in the code base).

SELECT @HTMLTableData =
(
    SELECT  HTMLRows 
    FROM (
        SELECT N'<tr>' + HTMLRow + N'</tr>' AS HTMLRows 
        FROM @HTMLRowData
    ) mi            
    FOR XML PATH(''), TYPE
).value('/', 'NVARCHAR(MAX)')
Munir
  • 3,442
  • 3
  • 19
  • 29
  • This is somewhat cumbersome but nested replace is what is going to have to happen to remove these specific characters. Nested replace is also super fast. You might consider turning this into an inline table valued function instead of a scalar function. FWIW, I don't this function is horrible. It is a lot of ugly code that is needed....kind of the point of functions in the first place. :) – Sean Lange Mar 31 '17 at 20:02
  • I'd post this on codereview.stackexchange.com – S3S Mar 31 '17 at 20:07
  • Web search yielded: https://www.sqlservercentral.com/Forums/Topic860321-338-1.aspx – xQbert Mar 31 '17 at 20:15
  • @SeanLange Thanks for the kind words :) – Munir Mar 31 '17 at 20:29
  • This sentence makes me thinking: *We recently migrated from SQL Server 2012 to SQL Server 2014 and all our FOR XML code started throwing errors about non-printable ASCII characters*. **Why and which errors?** How are you creating your XMLs? This should not happen... Read about the [XY-Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) It might help to show some example which worked in SS12 and does not work with SS14. I'm quite sure, that the problem - and its solution! - is somewhere else... – Shnugo Mar 31 '17 at 20:46
  • Btw: A while ago I answered a related question with a function to get namely foreign family names in plain latin. You [might have a look into this](http://stackoverflow.com/a/32048968/5089204) – Shnugo Mar 31 '17 at 20:48
  • @Shnugo The FOR XML is used to create emails after a job finishes. It uses data from tables and nothing else. The stored proc that sends out the emails just failed after we switched to 2014. We did not change anything else. – Munir Mar 31 '17 at 20:54
  • I still doubt this ... Please show some Code ... – Shnugo Apr 01 '17 at 13:25
  • @Shnugo Added the code that broke. There were multiple such blocks and all of them were giving errors. My suspicion is that encoding for `nvarchar` has changed between versions, but I have no way to validate that. I used the function to clean up the `HTMLRow` column in the innermost select. – Munir Apr 03 '17 at 17:32
  • @Munir This code is obviously creating an HTML table. You might [have a look at this generic solution](http://stackoverflow.com/a/39487565/5089204), which will make things easier. But back to your problem: My magic crystal ball tells me, that the problem is not connected to this, but to the filling of `@HtmlRowData`. I'm pretty sure, that with your last server there was a different default collation. Each database has its own default collation, but the server's temp db is depending on the server's default collation. This can lead into hardly findeably troubles... – Shnugo Apr 03 '17 at 18:56
  • @Munir ... continuing... You should **never ever** create XML (or HTML) on string level! This `SELECT N'' + HTMLRow + N'' AS HTMLRows` is a really bad approach! – Shnugo Apr 03 '17 at 18:57
  • @Shnugo It could be the collation issue you mentioned but I don't have access to the old server anymore to confirm. Also, this is not creating table, but just one row of the table. There's more such blocks later and then the final table is created as part of an email. I don't know why it has been done like this. There's many such issues in the entire code base. Like I mentioned, I am not the original developer, just the person who gets the fun task of making sure it runs as expected. – Munir Apr 03 '17 at 19:34
  • @Munir Just imagine, there is a `hi, Tom & Jerry` in your `@HtmlRowData`. This will be concatenated as `hi, Tom&Jerry`, which is invalid (`&` must be escaped to `&`)... Furthermore you should - for sure - not find the codes of non-printables like `CHAR(1)`... Your attempt with `REPLACE` might work, but is not more than an ugly repairment... The reason for all this is somewhere else... – Shnugo Apr 03 '17 at 19:39
  • @Shnugo I'm not disagreeing you with on any of this. I am sure the problem is somewhere else and I know what I have is an ugly hack (which was kind of the point of this question). Not sure what your point is here... – Munir Apr 03 '17 at 19:56
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/139803/discussion-between-shnugo-and-munir). – Shnugo Apr 03 '17 at 20:20

4 Answers4

13

Another Option.

This function will replace control characters and correct any residual repeating spaces. For example Jane Smith{13}was here will not be returned as Jane Smithwas here, but rather Jane Smith was here

CREATE FUNCTION [dbo].[udf-Str-Strip-Control](@S varchar(max))
Returns varchar(max)
Begin
    ;with  cte1(N) As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
           cte2(C) As (Select Top (32) Char(Row_Number() over (Order By (Select NULL))-1) From cte1 a,cte1 b)
    Select @S = Replace(@S,C,' ')
     From  cte2

    Return ltrim(rtrim(replace(replace(replace(@S,' ','†‡'),'‡†',''),'†‡',' ')))
End
--Select [dbo].[udf-Str-Strip-Control]('Michael        '+char(13)+char(10)+'LastName')  --Returns: Michael LastName
John Cappelletti
  • 79,615
  • 7
  • 44
  • 66
  • 1
    Thanks. I'll try this out tomorrow. I'm not sure I want to add spaces though since the control characters might be within a word. – Munir Mar 31 '17 at 20:32
  • 1
    @Munir fair enough, then just change Select @S = Replace(@S,C,' ') to Select @S = Replace(@S,C,'') however, you run the risk of concatinating words – John Cappelletti Mar 31 '17 at 20:34
  • 1
    @scsimon Can you blame me? When I first saw it, my first thought was "Meh..." then it it dawned on me... F'n brilliant ! – John Cappelletti Mar 31 '17 at 20:45
  • 1
    Shoot no. I still remember that answer and was blown away, hence my remembering it now! – S3S Mar 31 '17 at 20:46
  • 1
    So, if I understood this correctly, `cte1(N)` creates a table of 10 1's, cte2 converts them into the char(0) to char(100) and picks the top 32, the select replaces all the control chars and the expression in the return statement converts all multiple spaces into a single space. Since I want to preserve the input as is, I returned `@S` without any changes. This is also very slightly faster than the nested replaces for me, so I'll use this. Thank you! – Munir Apr 03 '17 at 17:53
  • 1
    @Munir SqlZim correctly pointed out that TVF are usually the best way to go, however, in this case there were just too many replace commands... diminishing returns ... – John Cappelletti Apr 03 '17 at 18:09
  • 1
    @ScottShaw-Smith While I respect the fact that you "own" the downvote, it does indeed work. OP accepted answer and and others upvoted. I don't believe it is worthy of a downvote – John Cappelletti Sep 01 '17 at 17:11
  • 1
    @ScottShaw-Smith I see that you downvoted SQLZim's as well. Just curious... what did we miss? – John Cappelletti Sep 01 '17 at 21:52
  • This is great! I used this code to create a function in our DB :) – Sabatino Ognibene Apr 13 '22 at 21:51
  • @JohnCappelletti why `'†‡'` and `'‡†'` are included? – OfirD Dec 18 '22 at 10:21
  • @OfirD Just a unique / rare combination of characters to remove any number of repeating spaces – John Cappelletti Dec 18 '22 at 16:20
9

In-line version:

create function [dbo].[remove_non_printable_chars] (@input_string nvarchar(max))
returns table with schemabinding as return (
  select 
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(@input_string collate latin1_general_100_bin2,
        char(1), ''),char(2), ''),char(3), ''),char(4), ''),char(5), ''),char(6), ''),char(7), ''),char(8), ''),char(9), ''),char(10), ''),
        char(11), ''),char(12), ''),char(13), ''),char(14), ''),char(15), ''),char(16), ''),char(17), ''),char(18), ''),char(19), ''),char(20), ''),
        char(21), ''),char(22), ''),char(23), ''),char(24), ''),char(25), ''),char(26), ''),char(27), ''),char(28), ''),char(29), ''),char(30), ''),
        char(31), ''), char(0) , '') 
     as clean_string
);
go

And use it like so:

select c.clean_string
from dbo.remove_non_printable_chars(@dirtystring) c

or

select ...
  , c.clean_string
from t
  cross apply dbo.remove_non_printable_chars(t.dirty_string) c

Reference:

Community
  • 1
  • 1
SqlZim
  • 37,248
  • 6
  • 41
  • 59
  • Thanks...I'll wait for some other answers. If not, I'll just switch to a table valued function. – Munir Mar 31 '17 at 20:31
  • Looks like you got downvoted by the same guy that downvoted me. I wonder what he knows that we missed. :) – John Cappelletti Sep 01 '17 at 21:49
  • 2
    @JohnCappelletti I guess we'll never know – SqlZim Sep 01 '17 at 22:05
  • Can you elaborate on the use of `collate latin1_general_100_bin2`? At first, I tried to omit it because my database has a different collation, but I noticed that for a query with ~3000 rows, the function is much faster *with* `collate latin1_general_100_bin2` (2 seconds vs 30 seconds)...and I don't understand why. – Christian Specht Jul 04 '18 at 08:52
  • @ChristianSpecht I'm concerned about the query taking 30 seconds to run over 3,000 rows even without the collation switch. How big are these strings you're using? I wouldn't have expected that big of a difference. [Guidelines for Using BIN and BIN2 Collations](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms143350(v=sql.105)) and [Paul White's answer about binary collation performance](https://dba.stackexchange.com/a/110935/43889) – SqlZim Jul 04 '18 at 16:15
2

Here is a similar answer of the previous inline-table valued function answer (https://stackoverflow.com/a/43148897/2864740) for this question. The primary change is this uses a patindex guard first and is much faster when only a small percentage of the rows contain characters that need to be replaced.

Thus the two huge improvements from the original scalar function:

  • Use an inline table-valued function. This is much faster as it allows SQL Server to directly in-line the code in the query plan. I try to avoid scalar functions in queries designed to scale, as a normal scalar function can be a huge performance sap (even with schemabinding) and prevent optimizations such as parallelism.

  • Use patindex for an initial guard check. This changes the number of characters SQL must exmine when there are no control characters to replace from O(n * num_replace_calls) to ~O(n). Since most of the data data (in my case) contains no control characters, this can result in a huge performance increase.

-- Only accepts VARCHAR(8000) to avoid a conversion to VARCHAR(MAX);
-- use the suitable input type, which might even be NVARCHAR(MAX).
CREATE FUNCTION DropControlCharactersTv(@str VARCHAR(8000))
RETURNS TABLE
WITH SCHEMABINDING
AS
    RETURN SELECT CleanedString = CASE
    -- No-op.
    WHEN @str IS NULL or @str = '' THEN @str
    -- If any of the non-wanted characters are found then go through the string and replace each occurrence of every character.
    -- This guard significantly improves the performance when very few strings need to be corrected.
    WHEN PATINDEX (
        '%[' + CHAR(0) + CHAR(1) + CHAR(2) + CHAR(3) + CHAR(4) + CHAR(5) + CHAR(6) + CHAR(7) + CHAR(8) + CHAR(11) + CHAR(12) + CHAR(14) + CHAR(15) + CHAR(16) + CHAR(17) + CHAR(18) + CHAR(19) + CHAR(20) + CHAR(21) + CHAR(22) + CHAR(23) + CHAR(24) + CHAR(25) + CHAR(26) + CHAR(27) + CHAR(28) + CHAR(29) + CHAR(30) + CHAR(31) + CHAR(127) + ']%',
        @str COLLATE Latin1_General_BIN) <> 0 THEN
        -- Replace, nested.
        -- See https://www.sqlservercentral.com/forums/topic/how-to-remove-characters-char0-to-char31
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        @str
        ,CHAR(0),'') COLLATE Latin1_General_BIN
        ,CHAR(1),'') COLLATE Latin1_General_BIN
        ,CHAR(2),'') COLLATE Latin1_General_BIN
        ,CHAR(3),'') COLLATE Latin1_General_BIN
        ,CHAR(4),'') COLLATE Latin1_General_BIN
        ,CHAR(5),'') COLLATE Latin1_General_BIN
        ,CHAR(6),'') COLLATE Latin1_General_BIN
        ,CHAR(7),'') COLLATE Latin1_General_BIN
        ,CHAR(8),'') COLLATE Latin1_General_BIN
        ,CHAR(9),'') COLLATE Latin1_General_BIN
        ,CHAR(10),'') COLLATE Latin1_General_BIN
        ,CHAR(11),'') COLLATE Latin1_General_BIN
        ,CHAR(12),'') COLLATE Latin1_General_BIN
        ,CHAR(13),'') COLLATE Latin1_General_BIN
        ,CHAR(14),'') COLLATE Latin1_General_BIN
        ,CHAR(15),'') COLLATE Latin1_General_BIN
        ,CHAR(16),'') COLLATE Latin1_General_BIN
        ,CHAR(17),'') COLLATE Latin1_General_BIN
        ,CHAR(18),'') COLLATE Latin1_General_BIN
        ,CHAR(19),'') COLLATE Latin1_General_BIN
        ,CHAR(20),'') COLLATE Latin1_General_BIN
        ,CHAR(21),'') COLLATE Latin1_General_BIN
        ,CHAR(22),'') COLLATE Latin1_General_BIN
        ,CHAR(23),'') COLLATE Latin1_General_BIN
        ,CHAR(24),'') COLLATE Latin1_General_BIN
        ,CHAR(25),'') COLLATE Latin1_General_BIN
        ,CHAR(26),'') COLLATE Latin1_General_BIN
        ,CHAR(27),'') COLLATE Latin1_General_BIN
        ,CHAR(28),'') COLLATE Latin1_General_BIN
        ,CHAR(29),'') COLLATE Latin1_General_BIN
        ,CHAR(30),'') COLLATE Latin1_General_BIN
        ,CHAR(31),'') COLLATE Latin1_General_BIN
        ,CHAR(127),'') COLLATE Latin1_General_BIN
    -- Did not match pattern: inherently valid
    ELSE @str END

In a query:

select
    Plucker.CleanedString
from Goose d
cross apply DropControlCharactersTv(d.Turkey) as Plucker
user2864740
  • 60,010
  • 15
  • 145
  • 220
1

Just extending the previous answers a little bit

Below the white-list characters, all the others chars will be cleared

[ !`"#$%&'()\*+,\-\./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\]^``\\_abcdefghijklmnopqrstuvwxyz{|}~µº°¡¢£¤¥¦§¨©ª«¬­®¯±²³´¶·¸¹»¼½¾¿×÷ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]

I know it's ugly but it works.

CREATE FUNCTION [dbo].REPLACE_UNPRINT_CHARS(@VarString nvarchar(256))  
RETURNS nvarchar(256) 
AS    
BEGIN  
    RETURN REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@VarString, CHAR(0), ''), CHAR(1), ''), CHAR(2), ''), CHAR(3), ''), CHAR(4), ''), CHAR(5), ''), CHAR(6), ''), CHAR(7), ''), CHAR(8), ''), CHAR(9), ''), CHAR(10), ''), CHAR(11), ''), CHAR(12), ''), CHAR(13), ''), CHAR(14), ''), CHAR(15), ''), CHAR(16), ''), CHAR(17), ''), CHAR(18), ''), CHAR(19), ''), CHAR(20), ''), CHAR(21), ''), CHAR(22), ''), CHAR(23), ''), CHAR(24), ''), CHAR(25), ''), CHAR(26), ''), CHAR(27), ''), CHAR(28), ''), CHAR(29), ''), CHAR(30), ''), CHAR(31), ''), CHAR(127), ''), CHAR(128), ''), CHAR(129), ''), CHAR(130), ''), CHAR(131), ''), CHAR(132), ''), CHAR(133), ''), CHAR(134), ''), CHAR(135), ''), CHAR(136), ''), CHAR(137), ''), CHAR(138), ''), CHAR(139), ''), CHAR(140), ''), CHAR(141), ''), CHAR(142), ''), CHAR(143), ''), CHAR(144), ''), CHAR(145), ''), CHAR(146), ''), CHAR(147), ''), CHAR(148), ''), CHAR(149), ''), CHAR(150), ''), CHAR(151), ''), CHAR(152), ''), CHAR(153), ''), CHAR(154), ''), CHAR(155), ''), CHAR(156), ''), CHAR(157), ''), CHAR(158), ''), CHAR(159), ''), CHAR(160), ''); 
END;

Use for data clean up

UPDATE [dnName].[dbo].[tableName] 
SET FieldDirtyData= dbo.REPLACE_UNPRINT_CHARS(FieldDirtyData)
WHERE PATINDEX('%['+CHAR(1)+CHAR(2)+CHAR(3)+CHAR(4)+CHAR(5)+CHAR(6)+CHAR(7)+CHAR(8)+CHAR(9)+CHAR(10)+CHAR(11)+CHAR(12)+
CHAR(13)+CHAR(14)+CHAR(15)+CHAR(16)+CHAR(17)+CHAR(18)+CHAR(19)+CHAR(20)+
CHAR(21)+CHAR(22)+CHAR(23)+CHAR(24)+CHAR(25)+CHAR(26)+CHAR(27)+CHAR(28)+CHAR(29)+CHAR(30)+CHAR(31)+CHAR(127)+
CHAR(128)+CHAR(129)+CHAR(130)+CHAR(131)+CHAR(132)+CHAR(133)+CHAR(134)+CHAR(135)+CHAR(136)+CHAR(137)+CHAR(138)+
CHAR(139)+CHAR(140)+CHAR(141)+CHAR(142)+CHAR(143)+CHAR(144)+CHAR(145)+CHAR(146)+CHAR(147)+CHAR(148)+CHAR(149)+CHAR(150)+
CHAR(151)+CHAR(152)+CHAR(153)+CHAR(154)+CHAR(155)+CHAR(156)+CHAR(157)+CHAR(158)+CHAR(159)+CHAR(160)+']%', FieldDirtyData) <> 0

Adjust your datatype (nvarchar or varchar + max) as required

If you want to add more chars to clear use "select ASCII('char to remove here')" MSSQL command in order to get the ASCII code of the char and put it inside the replace instruction

i.g SELECT ASCII('¢') returns 162

so add one more "REPLACE(" after "RETURN" and "CHAR(162), '')" at the end of line but before the ";" sign.

  • Now let's just hope that only users of Latin alphabet-based languages ever use the application. – Rab Mar 24 '21 at 18:40