How to handle weird unicode characters

Question

NSTR 2009-A – Underlying got a $1.3MM ($91.3MM remains). C/E rose to 67.1%

Below is the image of the above text in Notepad++ with Encode in UTF-8 turned on. The 'x96' is a dash and the 'xA0' are spaces. SQL Server gives Invalid Character error. How do I get rid of these @#$#? It's causing me a huge headache trying to fix ... ;-x

enter image description here

I tried below. It kept the dash but changed the 'xA0' to question marks:

byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(notesXML);
string notesXML = System.Text.Encoding.UTF8.GetString(tempBytes);

Tips appreciated, thanks !

What about the replace function? Should be able to use the unicode number to find the offending chars — Grantly, Oct 30 '14 at 00:54
@Mark, was hoping to handle this in c# in the webAPI before it gets to the DB — nanonerd, Oct 30 '14 at 01:10
Your C# code is almost certainly wrong. Start from the beginning: what are your input data, text or bytes, and if bytes, then what bytes exactly. — Karol S, Oct 30 '14 at 10:56
You say you have "Encode in UTF-8 turned on". The problem is that those bytes are not UTF-8 codes; byte values in the range 0x80..0xBF can only appear in UTF-8 after a byte in the range 0xC2..0xF4. So, the data is not in UTF-8; it is in some other, non-Unicode code page, where 0xA0 is a non-breaking space, and 0x96 is a long dash. Ultimately, you can't simply treat arbitrary data as UTF-8 encoded data, and you need to convert from the current code set to UTF-8. What I don't know is how best to help you convert it. I don't use Windows enough to know what the tools are. — Jonathan Leffler, Oct 30 '14 at 12:55

score 2 · Accepted Answer · answered Oct 31 '14 at 21:57

2

It looks like the encoding of your original text could be Windows 1252:

96 = U+2013 : EN DASH
A0 = U+00A0 : NO-BREAK SPACE

So using System.Text.Encoding.GetEncoding("Windows-1252"), you should be able to read your original text without corrupting it, and you can then convert it to whatever encoding is being used by your database (eg: utf-8).

answered Oct 31 '14 at 21:57

TachyonVortex

8,242
3
48
63

He drinks Blue Moon beer if his avatar is a hint, he doesn't get out much. Encoding.Default is the likely solution. – Hans Passant Oct 31 '14 at 22:14
Hans, you are European, if your name is a hint. Just because an American company is able to brew a Belgian style beer just as good or better than Euro beer, please don't be jealous my friend. – nanonerd Oct 31 '14 at 23:04
Tachy - I ended up doing a direct string Replace for U2013 and U2014 (to cover both en and em dash) and did the same for two different types of spaces. But conceptually, I think you're saying covert entire string from windows-1252 and convert it to utf-8? If so, that would be better solution to convert all other possible funky chars. I looked around but am not sure how to code this. If this is the correct thinking, could you provide a quick code sameple? Thank you. – nanonerd Oct 31 '14 at 23:09
@nanonerd Yes exactly, it'd be good to ensure that your solution can handle all possible characters. It looks like the [`Encoding.Convert`](http://msdn.microsoft.com/en-us/library/kdcak6ye(v=vs.110).aspx) method is probably what you want. There's a code sample [here](http://stackoverflow.com/a/1922253) which you could adapt for your needs. Also, you'll need to confirm what encoding is being used by your database (it may be `utf-8`, but it may not be.) – TachyonVortex Oct 31 '14 at 23:59

How to handle weird unicode characters

1 Answers1