1

NSTR 2009-A – Underlying got a $1.3MM ($91.3MM remains).  C/E rose to 67.1%

Below is the image of the above text in Notepad++ with Encode in UTF-8 turned on. The 'x96' is a dash and the 'xA0' are spaces. SQL Server gives Invalid Character error. How do I get rid of these @#$#? It's causing me a huge headache trying to fix ... ;-x

enter image description here

I tried below. It kept the dash but changed the 'xA0' to question marks:

byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(notesXML);
string notesXML = System.Text.Encoding.UTF8.GetString(tempBytes);

Tips appreciated, thanks !

nanonerd
  • 1,964
  • 6
  • 23
  • 49
  • 1
    What about the replace function? Should be able to use the unicode number to find the offending chars – Grantly Oct 30 '14 at 00:54
  • 1
    Can't you just configure the DB to accept UTF-8? – Mark Ransom Oct 30 '14 at 01:07
  • @Mark, was hoping to handle this in c# in the webAPI before it gets to the DB – nanonerd Oct 30 '14 at 01:10
  • Your C# code is almost certainly wrong. Start from the beginning: what are your input data, text or bytes, and if bytes, then what bytes exactly. – Karol S Oct 30 '14 at 10:56
  • 2
    You say you have "Encode in UTF-8 turned on". The problem is that those bytes are not UTF-8 codes; byte values in the range 0x80..0xBF can only appear in UTF-8 after a byte in the range 0xC2..0xF4. So, the data is not in UTF-8; it is in some other, non-Unicode code page, where 0xA0 is a non-breaking space, and 0x96 is a long dash. Ultimately, you can't simply treat arbitrary data as UTF-8 encoded data, and you need to convert from the current code set to UTF-8. What I don't know is how best to help you convert it. I don't use Windows enough to know what the tools are. – Jonathan Leffler Oct 30 '14 at 12:55

1 Answers1

2

It looks like the encoding of your original text could be Windows 1252:

96 = U+2013 : EN DASH
A0 = U+00A0 : NO-BREAK SPACE

So using System.Text.Encoding.GetEncoding("Windows-1252"), you should be able to read your original text without corrupting it, and you can then convert it to whatever encoding is being used by your database (eg: utf-8).

TachyonVortex
  • 8,242
  • 3
  • 48
  • 63
  • He drinks Blue Moon beer if his avatar is a hint, he doesn't get out much. Encoding.Default is the likely solution. – Hans Passant Oct 31 '14 at 22:14
  • Hans, you are European, if your name is a hint. Just because an American company is able to brew a Belgian style beer just as good or better than Euro beer, please don't be jealous my friend. – nanonerd Oct 31 '14 at 23:04
  • Tachy - I ended up doing a direct string Replace for U2013 and U2014 (to cover both en and em dash) and did the same for two different types of spaces. But conceptually, I think you're saying covert entire string from windows-1252 and convert it to utf-8? If so, that would be better solution to convert all other possible funky chars. I looked around but am not sure how to code this. If this is the correct thinking, could you provide a quick code sameple? Thank you. – nanonerd Oct 31 '14 at 23:09
  • @nanonerd Yes exactly, it'd be good to ensure that your solution can handle all possible characters. It looks like the [`Encoding.Convert`](http://msdn.microsoft.com/en-us/library/kdcak6ye(v=vs.110).aspx) method is probably what you want. There's a code sample [here](http://stackoverflow.com/a/1922253) which you could adapt for your needs. Also, you'll need to confirm what encoding is being used by your database (it may be `utf-8`, but it may not be.) – TachyonVortex Oct 31 '14 at 23:59