2

Say we have a string like below.

string s = "此检查项己被你忽略,请联系医生。\u2028内科";

How can I remove the unicode character like \u2028 in the string ?

I had tried the below function. Unfortunately they all doesn't work. Please save me. Thanks.

Unicode characters string

Convert a Unicode string to an escaped ASCII string

Replace unicode escape sequences in a string

Updated

Why the below code doesn't work for me ?

enter image description here

Updated I tried to display the string in the output. It is a line separator.

enter image description here

Joe.wang
  • 11,537
  • 25
  • 103
  • 180
  • hmm if you only want to remove that specific text in string then you could do `s.Replace('\'+"u2028", "");` – Agent_Orange Mar 03 '18 at 10:41
  • Are you sure that what you are seeing isn't a debugger artefact? If you were to write the string to a log/console/`Debug.WriteLine`, you'll see that the debugger visualizer includes escape codes that aren't the actual value of the string. – spender Mar 03 '18 at 10:53
  • 1
    I want to remove all these kinds of unicode characters in my string. not just `\u2028`.Thanks. @Agent_Orange – Joe.wang Mar 03 '18 at 10:53
  • Really: Take a look https://pasteboard.co/HaaWlfi.png What you are seeing is a debugger artefact. – spender Mar 03 '18 at 10:56
  • @AhmedAbdelhameed No, the line below the regex is highlighted yellow. That's where the debugger is stopped. – spender Mar 03 '18 at 10:56
  • @Joe.wang Oh i m sorry :v – Agent_Orange Mar 03 '18 at 10:59
  • @spender I tried `Debug.WriteLine` to display it in the output. I think it is a line separator. Thanks. – Joe.wang Mar 03 '18 at 11:01
  • 3
    The fundamental premise of your question (removing unicode) is broken, because all strings are stored as unicode in memory. All the characters are unicode. – spender Mar 03 '18 at 11:01
  • Unicode character 2028 (hex) is a "line separator" – Hans Kesting Mar 03 '18 at 11:03
  • The debugger does not display the string as it "really" is in memory, but a representation that could be used directly in source code. That is why you might see escape sequences like this, or extra backslashes before quotes and such. – Hans Kesting Mar 03 '18 at 11:05
  • @spender I have to remove these unicode character. Because next thing I want to do is string matching. For example. `string.index("some words")`. I think string without these unicode characters is different with the original. Thanks. – Joe.wang Mar 03 '18 at 11:07
  • Something like `string.Concat(input.Where(c => !char.IsSeparator(c))`? – spender Mar 03 '18 at 11:08
  • Still no lucky. @spender – Joe.wang Mar 03 '18 at 11:12
  • Ok. (I edited the above slightly, but now it gobbles spaces too). I don't think there's a one size fits all solution here. I don't know what the separator rules are for this character set but you may find a combination of methods on `char` (such as `IsLetterOrDigit` etc) that might suit your needs in a LINQ statement. – spender Mar 03 '18 at 11:15
  • There are two many of them(`\uxxxx`) to deal with . I can not do it one by one . Thanks.@spender – Joe.wang Mar 03 '18 at 11:17
  • Have you tried with `System.Net.WebUtility.UrlDecode()` or `System.Net.WebUtility.HtmlDecode()`? This is their job. – Jimi Mar 03 '18 at 14:31
  • `@"[^\u0000-\uFFFF]+"` is the set of UTF-16 code units not in the range of all UTF-16 code units—in other words, an empty set. Perhaps you meant `@"[\u0000-\uFFFF]+"`. (That statement must have been an experiment because it either does nothing or replaces all non-empty strings with the empty string.) – Tom Blodget Mar 03 '18 at 16:33

1 Answers1

2

As noted by @spender in the comments above:

The fundamental premise of your question (removing unicode) is broken, because all strings are stored as unicode in memory. All the characters are unicode.

However, if you have a non-escaped string in the format "\uXXXX" which you'd like to replace/remove, you can use something like this regex pattern: @"\\u[0-9A-Fa-f]{4}"

Here's a complete example:

string noUnicode = "此检查项己被你忽略,请联系医生。内科";

// If you hard-code the string, you MUST add an `@` before the string, otherwise,
// the "u2028" will get escaped and converted to its corresponding Unicode character.
string s = @"此检查项己被你忽略,请联系医生。\u2028内科";
string ss = Regex.Replace(s, @"\\u[0-9A-Fa-f]{4}", string.Empty);

Debug.Print("s = " + s);
Debug.Print("ss = " + ss);

Debug.Print((ss == noUnicode).ToString());

Here's a fiddle to test, and here's its output:

Fiddle

Note: Since the string is hard-coded, you have to use an @ here to prevent the sub-string "\u2028" from being converted to the corresponding Unicode char. On the other hand, if you get the original string from somewhere else (e.g., read from a text file), the sub-string "\u2028" is already represented as is, there should be no problem, and the above code should work just fine.

So, something like this would work exactly the same:

string s = File.ReadAllText(@"Path\to\a\Unicode\text\file\containing\the\string\'\u2028'");
string ss = Regex.Replace(s, @"\\u[0-9A-Fa-f]{4}", string.Empty);