How to remove the unicode character in the string

Question

Say we have a string like below.

string s = "此检查项己被你忽略，请联系医生。\u2028内科";

How can I remove the unicode character like \u2028 in the string ?

I had tried the below function. Unfortunately they all doesn't work. Please save me. Thanks.

Unicode characters string

Convert a Unicode string to an escaped ASCII string

Replace unicode escape sequences in a string

Updated

Why the below code doesn't work for me ?

Updated I tried to display the string in the output. It is a line separator.

hmm if you only want to remove that specific text in string then you could do `s.Replace('\'+"u2028", "");` — Agent_Orange, Mar 03 '18 at 10:41
Are you sure that what you are seeing isn't a debugger artefact? If you were to write the string to a log/console/`Debug.WriteLine`, you'll see that the debugger visualizer includes escape codes that aren't the actual value of the string. — spender, Mar 03 '18 at 10:53
I want to remove all these kinds of unicode characters in my string. not just `\u2028`.Thanks. @Agent_Orange — Joe.wang, Mar 03 '18 at 10:53
Really: Take a look https://pasteboard.co/HaaWlfi.png What you are seeing is a debugger artefact. — spender, Mar 03 '18 at 10:56
@AhmedAbdelhameed No, the line below the regex is highlighted yellow. That's where the debugger is stopped. — spender, Mar 03 '18 at 10:56
@spender I tried `Debug.WriteLine` to display it in the output. I think it is a line separator. Thanks. — Joe.wang, Mar 03 '18 at 11:01
The fundamental premise of your question (removing unicode) is broken, because all strings are stored as unicode in memory. All the characters are unicode. — spender, Mar 03 '18 at 11:01
The debugger does not display the string as it "really" is in memory, but a representation that could be used directly in source code. That is why you might see escape sequences like this, or extra backslashes before quotes and such. — Hans Keﬆing, Mar 03 '18 at 11:05
@spender I have to remove these unicode character. Because next thing I want to do is string matching. For example. `string.index("some words")`. I think string without these unicode characters is different with the original. Thanks. — Joe.wang, Mar 03 '18 at 11:07
Something like `string.Concat(input.Where(c => !char.IsSeparator(c))`? — spender, Mar 03 '18 at 11:08
Ok. (I edited the above slightly, but now it gobbles spaces too). I don't think there's a one size fits all solution here. I don't know what the separator rules are for this character set but you may find a combination of methods on `char` (such as `IsLetterOrDigit` etc) that might suit your needs in a LINQ statement. — spender, Mar 03 '18 at 11:15
There are two many of them(`\uxxxx`) to deal with . I can not do it one by one . Thanks.@spender — Joe.wang, Mar 03 '18 at 11:17
Have you tried with `System.Net.WebUtility.UrlDecode()` or `System.Net.WebUtility.HtmlDecode()`? This is their job. — Jimi, Mar 03 '18 at 14:31
`@"[^\u0000-\uFFFF]+"` is the set of UTF-16 code units not in the range of all UTF-16 code units—in other words, an empty set. Perhaps you meant `@"[\u0000-\uFFFF]+"`. (That statement must have been an experiment because it either does nothing or replaces all non-empty strings with the empty string.) — Tom Blodget, Mar 03 '18 at 16:33

41686d6564 stands w. Palestine · Answer 1 · 2018-03-03T12:07:26.957

2

As noted by @spender in the comments above:

The fundamental premise of your question (removing unicode) is broken, because all strings are stored as unicode in memory. All the characters are unicode.

However, if you have a non-escaped string in the format "\uXXXX" which you'd like to replace/remove, you can use something like this regex pattern: @"\\u[0-9A-Fa-f]{4}"

Here's a complete example:

string noUnicode = "此检查项己被你忽略，请联系医生。内科";

// If you hard-code the string, you MUST add an `@` before the string, otherwise,
// the "u2028" will get escaped and converted to its corresponding Unicode character.
string s = @"此检查项己被你忽略，请联系医生。\u2028内科";
string ss = Regex.Replace(s, @"\\u[0-9A-Fa-f]{4}", string.Empty);

Debug.Print("s = " + s);
Debug.Print("ss = " + ss);

Debug.Print((ss == noUnicode).ToString());

Here's a fiddle to test, and here's its output:

Note: Since the string is hard-coded, you have to use an @ here to prevent the sub-string "\u2028" from being converted to the corresponding Unicode char. On the other hand, if you get the original string from somewhere else (e.g., read from a text file), the sub-string "\u2028" is already represented as is, there should be no problem, and the above code should work just fine.

So, something like this would work exactly the same:

string s = File.ReadAllText(@"Path\to\a\Unicode\text\file\containing\the\string\'\u2028'");
string ss = Regex.Replace(s, @"\\u[0-9A-Fa-f]{4}", string.Empty);

edited Mar 03 '18 at 12:07

answered Mar 03 '18 at 11:17

41686d6564 stands w. Palestine

19,168
12
41
79

No. adding `@` before the string is different. \ will be escaped to \\ – Joe.wang Mar 03 '18 at 11:20
It doesn't work, I checked it: `此检查项己被你忽略，请联系医生。 内科 ` still shows that Unicode. – FaizanHussainRabbani Mar 03 '18 at 11:21
@Joe.wang EXACTLY, otherwise it will have the actual Unicode char instead of "\u2028". – 41686d6564 stands w. Palestine Mar 03 '18 at 11:22
@FaizanRabbani What exactly did you try? I just updated the fiddle, please check it and let me know what doesn't work. – 41686d6564 stands w. Palestine Mar 03 '18 at 11:28
@AhmedAbdelhameed if you view the string in `HTML Visualizer`, you can still see that Unicode character – FaizanHussainRabbani Mar 03 '18 at 11:30
@FaizanRabbani Did you check the fiddle above? Did you check the output for the line that says `Console.WriteLine(ss == noUnicode);`? – 41686d6564 stands w. Palestine Mar 03 '18 at 11:38
You are adding `@` sign. That's the difference. – FaizanHussainRabbani Mar 03 '18 at 11:42
@FaizanRabbani Exactly, you either have to add an `@` *when hard-coding the string* or get the string from somewhere else (e.g., text file). FYI, [here's the VS version](https://s13.postimg.org/67h2rjebr/image.png) – 41686d6564 stands w. Palestine Mar 03 '18 at 11:45
@AhmedAbdelhameed Yes I tried with `@` character and it works – FaizanHussainRabbani Mar 03 '18 at 11:46
@AhmedAbdelhameed Sorry. I should say "unvisible unicode char". – Joe.wang Mar 03 '18 at 12:01
@Joe.wang What exactly does that mean? Do you mean the string `"\u2028"` is already escaped and is represented as the corresponding Unicode char? If so, then it's no different from any other char, how do you think you can identify it? – 41686d6564 stands w. Palestine Mar 03 '18 at 12:05
I edited the answer several times in the last couple minutes to make things more clear. Perhaps you haven't checked the updated version. Please do. – 41686d6564 stands w. Palestine Mar 03 '18 at 12:06
we know all the char in the string are unicode . I think it is because `\u2028` can not display like other char in the debugger. so it just display the code point in the debugger. @AhmedAbdelhameed – Joe.wang Mar 03 '18 at 12:16
so .I think more precisely saying . they are invisible unicode characters. – Joe.wang Mar 03 '18 at 12:17
@AhmedAbdelhameed Thanks man, I think your answer is not right for my question. I think @spender had answer something in the comment. `string.Concat(input.Where(c => !char.IsSeparator(c))`.I can see It is right way to solution so far. I don't know if there is better solution. Let's wait and see. Thanks. – Joe.wang Mar 03 '18 at 12:30
The question and answer I've been looking4. – captain_majid Oct 01 '22 at 22:02

How to remove the unicode character in the string

1 Answers1