6

How can we replace symbols from string in C#?

Like this

Input : "�Click me."

Output: "Click me.";

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Novice
  • 981
  • 6
  • 12
  • 25
  • 7
    How did you get such symbols in the first place? Looks like you have crapped encoding. Replacing is not a solution. Tackle the problem at its roots which is: fix the way you are getting this string and don't try to resuscitate a dead thing. – Darin Dimitrov Mar 02 '11 at 17:17
  • Looks like an [encoding/codepage issue](http://www.joelonsoftware.com/articles/Unicode.html). – Uwe Keim Mar 02 '11 at 17:18
  • Are you just trying to remove everything but alphanumeric and punctuation? – Babak Naffas Mar 02 '11 at 17:21
  • @Darin: Wont ALT + give all such symbols? :) lets assume he has to deal with all that. – naveen Mar 02 '11 at 17:23
  • @yetanothercoder, that's something that should have been mentioned in the question. Usually when people ask a question on StackOverflow it is considered good practice to describe their scenario. Unfortunately this isn't the case with this question. – Darin Dimitrov Mar 02 '11 at 17:24
  • @Darin: always respected you wisdom. its something called novice :) – naveen Mar 02 '11 at 17:29

3 Answers3

7

A simplistic solution would be to strip all non-ASCII characters from your string. There are a couple of ways to do this available on this question, the simplest of which would probably be:

string s = "�Click me.";
s = Regex.Replace(s, @"[^\u0000-\u007F]", "");

Although as mentioned, this may be an encoding/codepage issue -- using a regex here may not necessarily be the appropriate solution.

EDIT: Based on your comments, here are a couple other patterns you can try:

Remove all non-ASCII characters and ASCII control characters:

s = Regex.Replace(s, @"[^\u0020-\u007F]", "");

Remove everything except for alphanumeric ASCII characters:

s = Regex.Replace(s, @"[^A-Za-z0-9]", "");
Community
  • 1
  • 1
Donut
  • 110,061
  • 20
  • 134
  • 146
  • when I try write xml after replacing non-ascii characters then still I am getting this error "hexadecimal value 0x05 is invalid character" – Novice Mar 02 '11 at 17:52
  • @Novice `0x05` is an ASCII control character. If you wanted to remove those as well, you could use this instead of what's posted in my answer: `s = Regex.Replace(s, @"[^\u0020-\u007F]", "");`. Where are you getting your input from? – Donut Mar 02 '11 at 17:58
  • I am getting input from mysql database. Field characterset in table is "utf_8" – Novice Mar 02 '11 at 18:05
  • @Novice Try using the second code snippet I posted in my comment, does this solve the issue? – Donut Mar 02 '11 at 18:36
  • @Donut Yes, it has solved issue partially now when it replaces another input string then it throws error again "0X1C is invalid character". Can we replace all characters instead of a-z A-Z 0-9 ? Thanks for your reply – Novice Mar 02 '11 at 18:52
  • @Novice Updated my answer with another pattern that you can try that should replace all non-alphanumeric characters. – Donut Mar 02 '11 at 18:56
  • @Donut I've tried but unfortunately still its throwing an error "0x1C is an invalid hexadecimal character" – Novice Mar 02 '11 at 19:08
3
var output = input.Replace("�","");

Simples!

Jamiec
  • 133,658
  • 13
  • 134
  • 193
  • 1
    Yeah, but `�` is what you see on the screen, the actual value is probably something else so this replace won't likely do much. – Darin Dimitrov Mar 02 '11 at 17:20
  • Oh totally, I just answered the question directly - "How can we replace symbols from string in C#" – Jamiec Mar 02 '11 at 17:21
  • @Darin Dimitrov: It surely would remove � thought? It is a character *somewhere* after all – Kurru Mar 02 '11 at 17:23
  • 1
    @ Kurru, of course, it's just that I doubt very much that the actual character to be replaced here is �. – Darin Dimitrov Mar 02 '11 at 17:24
  • @HansPassant: The question-mark-in-diamond symbol represents the Unicode Replacement Character, [`U+FFFD`](http://www.fileformat.info/info/unicode/char/fffd/index.htm). It's what the decoder generates when it finds a byte pattern that's invalid for the current encoding. A character that's not supported by the current font usually appears as a simple rectangle. – Alan Moore Sep 20 '14 at 15:16
2

You can also use unicode block names:

source = Regex.Replace(source , @"\p{name}", "");

A list of names can be found in this article. I'm not sure what block your character would belong to though.

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486