Read a file with unicode characters

Question

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).

FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);

//strip out bad characters
content = content.Replace("’", "'");

This doesn't work and it changes the slanted apostrophes into ? marks.

You said it changed the slanted one into "?". Which means that the first argument to your Replace function is correct, but then the second argument is wrong. It is probably a Unicode character that *looks* like a single quote, but not actually a single quote. In displays without a Unicode font, or when printed to the screen, an unrecognized Unicode character is displayed as "?". — Stephen Chung, Apr 27 '11 at 02:08
Check to see whether the second argument is the correct character. You may have accidentally turned on an Asian IME or something and typed an Asian quote character (which is Unicode) that looks exactly like a simple quote on screen. It is sometimes very hard to tell the difference. — Stephen Chung, Apr 27 '11 at 02:09
Yes it is with the reading of the file. I used string content = File.ReadAllText(fileinfo.FullName, Encoding.Default); which read it in correctly. Thanks! — chris, Apr 27 '11 at 19:47

score 15 · Answer 1 · edited May 23 '17 at 12:09

I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:

content[0]; // 65533 '�'

The reason why the replace isn't working is simple - content doesn't contain the string you gave it:

content.IndexOf("’"); // -1

As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:

string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));

(See this question).

You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:

content = content.Replace("\u0092", "'");

Rather than `(char)146`, `'\u0092'` might be more readable, since it matches the character code charts. — Jeffrey L Whitledge, Apr 27 '11 at 04:16
The reason why `'\u0092' == (char)146` is because the `\u` notation uses hexidecimal, and `0x92 == 146` — Justin, Apr 27 '11 at 04:27

James Lawruk · Answer 2 · 2011-07-11T17:10:01.363

3

My bet is the file is encoded in Windows-1252. This is almost the same as ISO 8859-1. The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". (Which is where the slanted apostrophe is located. i.e. 0x92)

//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");

edited Jul 11 '11 at 17:10

answered May 26 '11 at 17:07

James Lawruk

30,112
19
130
137

Encoding.GetEncoding("Windows-1252") – Daniel Aug 25 '16 at 07:01

Trey Carroll · Answer 3 · 2011-04-27T01:38:26.750

2

// This should replace smart single quotes with a straight single quote

Regex.Replace(content, @"(\u2018|\u2019)", "'");

//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));

edited Apr 27 '11 at 01:38

answered Apr 27 '11 at 00:55

Trey Carroll

2,382
4
22
28

score 0 · Answer 4 · answered Apr 27 '11 at 01:56

0

If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.

answered Apr 27 '11 at 01:56

kappasims

124
9

3

one is an alias for the other, this doesn't change anything. – BrokenGlass Apr 27 '11 at 01:59
You're right! Then I'd assume the quotation marks in question aren't 2018/9 and maybe are dependent on the locale. Cast to an int or short to get the Unicode value and replace \u+thatNumber with what was posted earlier. – kappasims Apr 27 '11 at 02:06

Read a file with unicode characters

4 Answers4

Linked