Replacing specific Unicode characters in strings read from Excel

Question

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.

I retrieve the cell contents into a string thus:

string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();

Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):

s = Regex.Replace(s, "\u0094", "\u0022");

But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:

bool a = s.Contains("\u0094");

but it returns false. However:

bool b = s.Contains("”");

returns true.

My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.

score 4 · Accepted Answer · edited May 23 '17 at 11:56

Yes strings in .Net are UTF-16.

You're doing it right; perhaps your hex-math is incorrect. The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:

((int)"”"[0]).ToString("X") returns "201D"

"”" == "\u201D" returns true

"\u0094" == "" (right hand side is the empty string) returns false

A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.

References - Jon Skeet's articles on:

Yes, you're quite right, my hex values were way off. Turns out I was using code points for the WIN1252 character set instead of UTF-16. One day I'll understand all of this (upon which it will be replaced by something vastly more complicated in incomprehensible). Thanks for the answer. — Sid Holland, Oct 03 '11 at 17:55

score 2 · Answer 2 · answered Oct 07 '11 at 08:48

2

You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters. That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

answered Oct 07 '11 at 08:48

Mihai Nita

5,547
27
27

Yes, that would be ideal, but sadly I have no control over the database. – Sid Holland Oct 08 '11 at 04:14

Replacing specific Unicode characters in strings read from Excel

2 Answers2