1

There is a weird whitespace character I can't seem to get rid of that occasionally shows up in my data when importing from Excel. Visibly, it comes across as a whitespace character BUT SQL Server sees it as a question mark (ASCII 63).

declare @temp nvarchar(255); set @temp = 'carolg@c?am.com'
select @temp

returns:

?carolg@c?am.com

How can I get rid of the whitespace without getting rid of real question marks? If I look at the ASCII code for each of those "?" characters I get 63 when in fact, only one of them is a real qustion mark.

wgpubs
  • 8,131
  • 15
  • 62
  • 109
  • what if you do an Ltrim(@temp)? – Harry Oct 17 '17 at 00:09
  • 2
    Note: If, for some reason, you wanted to keep the string as is, you would do `set @temp = N'mystring'` (explicitly declaring the string is unicode, which you should do even if your variable is `NVARCHAR`) rather than what you're currently doing. – ZLK Oct 17 '17 at 00:18
  • Look s like a character set mismatch. – Namphibian Oct 17 '17 at 00:55
  • Your question does not appear to contain the character you are discussing. Please add it. @ZLK gives the answer for your demo code but I take it that in Excel you aren't using that code. So, what's your real code, too? – Tom Blodget Oct 17 '17 at 13:57
  • @TomBlodget ... when I copy/paste it doesn't show up. The character turns out to be "0x0B20" (unicode for zero-width space). The answer below resolved it. – wgpubs Oct 17 '17 at 21:24

1 Answers1

2

Have a look at this answer for someone with a similar issue. Sorry if this is a bit long winded:

SQL Server seems to flatten Unicode to ASCII by mapping unrepresentable characters (for which there is no suitable substitution) to a question mark. To replicate this, try opening the Character Map Windows program (should be installed on most machines), select Arial as the font and find U+034f "Combining Grapheme Joiner". select this character, copy to clipboard and paste it between the single quotes below:

declare @t nvarchar(10)
set @t = '͏'
select rtrim(ltrim(@t)) -- we can try and trim it, but by this stage it's already a '?'

You'll get a question mark out, because it doesn't know how to represent this non-ASCII character when it casts it to varchar. To force it to accept it as a double-byte character (nvarchar) you need to use N'' instead, as has already been mentioned. Add an N before the quotes above and the question mark disappears (but the original invisible character is preserved in the output - and ltrim and rtrim won't remove it as demonstrated below):

declare @t nvarchar(10), 
        @s varchar(10) -- note: single-byte string
set @t = rtrim(ltrim(N'͏')) -- trimming doesn't work here either
set @s = @t 
select @s -- still outputs a question mark

Imported data can definitely do this, I've seen it before, and characters like the one I've shown above are particularly hard to diagnose because you can't see them! You will need to create some sort of scrubbing process to remove these unprintables (and any other junk characters, for that matter), and make sure that you use nvarchar everywhere, or you'll end up with this issue. Worse, those phantom question marks will become real question marks that you won't be able to distinguish from legitimate ones.

To see what character code you're dealing with, you can cast as varbinary as follows:

declare @t nvarchar(10)
set @t = N'͏test?'
select cast(@t as varbinary) -- returns 0x4F0374006500730074003F00

-- Returns:
-- 0x4F03  7400 6500 7300 7400 3F00
-- badchar  t    e    s    t    ?

Now to get rid of it:

declare @t nvarchar(10)
set @t = N'͏test?'
select cast(@t as varbinary) -- bad char
set @t = replace(@t COLLATE Latin1_General_100_BIN2, nchar(0x034f), N'');
select cast(@t as varbinary)  -- gone!

Note I had to swap the byte order from 0x4f03 to 0x034f (same reason "t" appears in the output as 0x7400, not 0x0074). For some notes on why we're using binary collation, see this answer.

This is kind of messy, because you don't know what the dirty characters are, and they could be one of thousands of possibilities. One option is to iterate over strings using like or even the unicode() function and discard characters in strings that aren't in a list of acceptable characters, but this could be slow. It may be that most of your bad characters are either at the start or end of the string, which might speed this process up if that's an assumption you think you can make.

You may need to build additional processes either external to SQL Server or as part of a SSIS import based on what I've shown you above to strip this out quickly if you have a lot of data to import. If you aren't sure the best way to do this, that's probably best answered in a new question.

I hope that helps.

pcdev
  • 2,852
  • 2
  • 23
  • 39
  • great explanation! literally was going insane. In my case the character was "0x0B20" ... unicode for a zero-width space. WTF that means, I have no idea. – wgpubs Oct 17 '17 at 21:23