1

This has been a major headache for the last couple of weeks. I have a largish table (165 columns x 11000+ rows). In this table, there are several comment columns that are set to varchar(max). There is one in particular that keeps getting invalid characters pasted into it by various users. This causes the reports in SSRS to fail. I then have to go and find these invalid characters and remove them. This has been a very painstaking time-consuming task.

What I would like to do is find a way to search automatically for these invalid characters and replace them with nothing. The problem is that I have no idea how to go about a search directly for these characters. Here is what they look like:

Crap in the SQL

and here's another image of the same:

Crap in the SQL

And here is what it looks like when I paste it into Notepad++:

Crap in SQL

I am not sure if it'll work and show up the way I see it, but here are the characters:

㹊潮Ņ᯸ࢹᖈư㹨ƶ槹鎤⻄ƺ綐ڌ⸀ƺ삸)䀤ƍ샄)Ņᛡ鎤ꗘᖃᒨ쬵Ğᘍ鎤ᐜᏰ>֔υ赸Ƹ쳰డ촜)鉀௿촜)쮜)Ἡ屰山舰霡ࣆ 耏Аం畠Ư놐ᓜતᏛ֔Ꮫ֨Ꮫ꯼ᓜƒ 邰఍厰ఆ邰఍드)抉鎤듄)繟Ĺ띨)᯸ࢹ䮸ࣉ᯸ࢹ䮸ࣉ샰)ԌƏŅ֐ᕄ홑Ņᛙ鎤ꗘᖃᒨ᯸ࢹ

They look like they are Chinese or something similar, but I've tried using Google Translate and it detects them as English.

Any help in figuring out a way to search for these? Create a Function or an SP would be fine as long as it works!

UPDATE

I've tried part of a solution I found here: How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table? and used this:

-- Start with tab, line feed, carriage return
declare @str varchar(1024)
set @str = '|' + char(9) + '|' + char(10) + '|' + char(13)

-- Add all normal ASCII characters (32 -> 127)
declare @i int
set @i = 32
while @i <= 127
    begin
    -- Uses | to escape, could be any character
    set @str = @str + '|' + char(@i)
    set @i = @i + 1
    end

select MEETING_NOTES from pmdb.TrackerData
where MEETING_NOTES like '%[^' + @str + ']%' escape '|'

But it is returning a lot more rows than it should be. I currently only have 1 row that has these invalid characters and it is returning 1708.

UPDATE 2

I have created a Function to try and remove all the invalid characters like this:

ALTER FUNCTION [dbo].[RemoveNonPrintable]
(
    @inputtext nvarchar(max) 
)
RETURNS nvarchar(max)
AS
BEGIN

    DECLARE @counter int = 1;
    DECLARE @colString nvarchar(1000)
    set @inputtext = REPLACE(@inputtext, char(0), '') -- 'NULL'
    set @inputtext = REPLACE(@inputtext, char(1), '') -- 'Start of Heading'
    set @inputtext = REPLACE(@inputtext, char(2), '') -- 'Start of Text'
    set @inputtext = REPLACE(@inputtext, char(3), '') -- 'End of Text'
    set @inputtext = REPLACE(@inputtext, char(4), '') -- 'End of Transmission'
    set @inputtext = REPLACE(@inputtext, char(5), '') -- 'Enquiry'
    set @inputtext = REPLACE(@inputtext, char(6), '') -- 'Acknowledgement'
    set @inputtext = REPLACE(@inputtext, char(7), '') -- 'Bell'
    set @inputtext = REPLACE(@inputtext, char(8), '') -- 'Backspace'
    set @inputtext = REPLACE(@inputtext, char(9), '') -- 'Horizontal Tab'
    -- replace line feed with blank, so words that were in different lines before are still separated
    set @inputtext = REPLACE(@inputtext, char(10), ' ') -- 'Line Feed'
    set @inputtext = REPLACE(@inputtext, char(11), '') -- 'Vertical Tab'
    set @inputtext = REPLACE(@inputtext, char(12), '') -- 'Form Feed'        
    -- replace carriage return with blank, so words that were in different lines before are still separated
    set @inputtext = REPLACE(@inputtext, char(13), ' ') -- 'Carriage Return'
    set @inputtext = REPLACE(@inputtext, char(14), '') -- 'Shift Out'
    set @inputtext = REPLACE(@inputtext, char(15), '') -- 'Shift In'
    set @inputtext = REPLACE(@inputtext, char(16), '') -- 'Data Link Escape'
    set @inputtext = REPLACE(@inputtext, char(17), '') -- 'Device Control 1'
    set @inputtext = REPLACE(@inputtext, char(18), '') -- 'Device Control 2'
    set @inputtext = REPLACE(@inputtext, char(19), '') -- 'Device Control 3'
    set @inputtext = REPLACE(@inputtext, char(20), '') -- 'Device Control 4'
    set @inputtext = REPLACE(@inputtext, char(21), '') -- 'Negative Acknowledgment'
    set @inputtext = REPLACE(@inputtext, char(22), '') -- 'Synchronous Idle'
    set @inputtext = REPLACE(@inputtext, char(23), '') -- 'End of Transmission Block'
    set @inputtext = REPLACE(@inputtext, char(24), '') -- 'Cancel'
    set @inputtext = REPLACE(@inputtext, char(25), '') -- 'End of Medium'
    set @inputtext = REPLACE(@inputtext, char(26), '') -- 'Substitute'
    set @inputtext = REPLACE(@inputtext, char(27), '') -- 'Escape'
    set @inputtext = REPLACE(@inputtext, char(28), '') -- 'File Separator'
    set @inputtext = REPLACE(@inputtext, char(29), '') -- 'Group Separator'
    set @inputtext = REPLACE(@inputtext, char(30), '') -- 'Record Separator'
    set @inputtext = REPLACE(@inputtext, char(31), '') -- 'Unit Separator'
    set @inputtext = REPLACE(@inputtext, char(127), '') -- 'Delete'

    set @colString = @inputtext

    WHILE @counter <= DATALENGTH(@colString)
       BEGIN
           set @colString   = REPLACE(@colString,isnull(NCHAR(UNICODE(SUBSTRING(@colString, @counter, 1))),'|'),'|')
           set @colString = REPLACE(@colString,'|','')

           SET @counter = @counter + 1
       END
    return @inputtext       
END

I call it like this:

BEGIN TRAN  --COMMIT    ROLLBACK
update pmdb.TrackerData
set CIRCUIT_COMMENTS = [dbo].[RemoveNonPrintable](CIRCUIT_COMMENTS),
    COE_COMMENTS = [dbo].[RemoveNonPrintable](COE_COMMENTS),
    MEETING_NOTES = [dbo].[RemoveNonPrintable](MEETING_NOTES),
    OSP_COMMENTS = [dbo].[RemoveNonPrintable](OSP_COMMENTS),
    COE_COMMENTS2 = [dbo].[RemoveNonPrintable](COE_COMMENTS2)

Then I run the code from the previous update to see if there is any difference. There is no difference. What gives? Am I doing this wrong?

EDIT 3

I've updated my function to have this:

    set @colString = @inputtext

    WHILE @counter <= DATALENGTH(@colString)
       BEGIN
           --set @colString = REPLACE(@colString,isnull(NCHAR(UNICODE(SUBSTRING(@colString, @counter, 1))),'|'),'|')
           --set @colString = REPLACE(@colString,'|','')
           if (UNICODE(SUBSTRING(@colString, @counter,1)) > 126)
           BEGIN
            SET @colString = REPLACE(@colString, CONVERT(nvarchar(1),(SUBSTRING(@colString, @counter,1))), CHAR(32))
           END
           ELSE IF(UNICODE(SUBSTRING(@colString, @counter, 1)) < 32)
           BEGIN
            SET @colString = REPLACE(@colString, CONVERT(nvarchar(1),(SUBSTRING(@colString, @counter,1))), CHAR(32))
           END

           set @inputtext = @colString

           SET @counter = @counter + 1
       END

It removes most of the invalid characters, but then it leaves others. I call it on a temp table that I created that holds the sample of invalid characters shown above like this:

update #Temp
set Notes = [dbo].[RemoveNonPrintable](Notes),
    Notes2 = [dbo].[RemoveNonPrintable](Notes2)

Then I'm left with the following in the two Notes:

Notes: ????N???u?z?????????)???)?N??????G????>???????)???)?)???????? ????U??????????  ???????)???)?L?)?????????)?????N???N???????

Notes2:     ᯸ࢹᖈ 㹨   ⻄   ⸀  )䀤  )  ᛡ ꗘᖃᒨ  ᘍ ᐜᏰ>֔      ) ௿ ) )Ἡ      ࣆ       ᓜ Ꮫ֔Ꮫ֨Ꮫ꯼ᓜ   ఍   ఍ )   )   )᯸ࢹ䮸ࣉ᯸ࢹ䮸ࣉ )Ԍ   ֐ᕄ  ᛙ ꗘᖃᒨ᯸ࢹ

Which is better than what I started with, but still not good enough.

Mike
  • 1,853
  • 3
  • 45
  • 75
  • Are the invalid characters only in one column of the table? Or is there no way of telling where the invalid characters are? – Crazy Cucumber Jun 28 '17 at 14:12
  • @CrazyCucumber There are 5 columns that they could show up in. So far I know that they've shown up in 2 of them. The repeat offender is called Meeting_Notes. – Mike Jun 28 '17 at 14:13
  • Is there a unique identifier for each row? – Crazy Cucumber Jun 28 '17 at 14:14
  • 1
    You could check the UNICODE values but it would not be fast as it would have to look at each character. Your columns are varchar and not nvarchar? Have you tried changing the column to nvarchar? – Sean Lange Jun 28 '17 at 14:15
  • Won't the 11,000+ rows get in the way of changing the data type of the column? I think you'd get an error because of truncation/data modification when you try to change the column name. – Crazy Cucumber Jun 28 '17 at 14:17
  • Have you thought about exporting the table to an Excel file, find and replace all the characters with blanks, import the updated Excel file to SQL, and updating the original table with the imported table's values? – Crazy Cucumber Jun 28 '17 at 14:18
  • 7
    Anytime someone says take something out of a DB and put it into Excel or Access and then back into a DB a kitten dies – S3S Jun 28 '17 at 14:19
  • @scsimon Haha I figured someone was going to say that. forget that I even suggested that :) – Crazy Cucumber Jun 28 '17 at 14:20
  • @SeanLange the Meeting_Notes column is `nvarchar` some of the others are only `varchar`. It doesn't make any difference. – Mike Jun 28 '17 at 14:20
  • @CrazyCucumber Doing an Export/Import will not work as this is live data that people are working with throughout the day. I can modify a line here and there or all of it as it's pulled for reports, but I can't do it that way. I was hoping that there was a way to search for these specific types of characters and replace or remove them. – Mike Jun 28 '17 at 14:22
  • @SeanLange How would I check the UNICODE values? – Mike Jun 28 '17 at 14:23
  • For searching, how about you flip over the search criteria where you search for valid characters which would be easier to write. Then you could look to apply NOT EXISTS or something similar over the top, to show those erroneous rows? just a thought. – kevchadders Jun 28 '17 at 14:38
  • To see the UNICODE value of a character you use UNICODE(YourCharacterHere) – Sean Lange Jun 28 '17 at 14:39
  • @CrazyCucumber not sure what you mean about 11,000 rows preventing the datatype from being changed. That isn't even a big table so changing the datatype should be pretty fast. No idea what you mean about truncation error or changing the column name. – Sean Lange Jun 28 '17 at 14:40

1 Answers1

0

I have found a solution in another users question here

I modified it slightly though. What ends up working for me is this:

ALTER FUNCTION [dbo].[RemoveNonASCII] 
(
    -- Parameters
    @nstring nvarchar(max)
)
RETURNS varchar(max)
AS
BEGIN
    -- Variables
    DECLARE @Result varchar(max) = '',@nchar nvarchar(1), @position int

    -- T-SQL statements to compute the return value
    set @position = 1
    while @position <= LEN(@nstring)
    BEGIN
        set @nchar = SUBSTRING(@nstring, @position, 1)
        if UNICODE(@nchar) between 32 and 127
            set @Result = @Result + @nchar
        set @position = @position + 1
        set @Result = REPLACE(@Result,'))','')
        set @Result = REPLACE(@Result,'?','')
    END

    -- Return the result
    RETURN @Result

END
Mike
  • 1,853
  • 3
  • 45
  • 75