8

I'm a doing an massive uploading of information from a .csv file and I need replace this character non ASCII "�" for a normal space, " ".

The character "�" corresponds to "\uFFFD" for C, C++, and Java, which it seems that it is called REPLACEMENT CHARACTER. There are others, such as spaces type like U+FEFF, U+205F, U+200B, U+180E, and U+202F in the C# official documentation.

I'm trying do the replace this way:

public string Errors = "";

public void test(){

    string textFromCsvCell = "";
    string validCharacters = "^[0-9A-Za-z().:%-/ ]+$";
    textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " "
    string cleaned = textFromCsvCell.Replace("\uFFFD", "\"")
      if (Regex.IsMatch(cleaned, validCharacters ))
        //All code for insert
      else
         Errors=cleaned;
         //print Errors
}

The test method shows me this text:

"This is my�texto from csv file"

I try some solutions too:

Trying solution 1: Using Trim

 Regex.Replace(value.Trim(), @"[^\S\r\n]+", " ");

Try solution 2: Using Replace

  System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " ");

Try solution 3: Using Trim

  String.Trim(new char[]{'\uFEFF', '\u200B'});

Try solution 4: Add [\S\r\n] to validCharacters

  string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$";

Nothing works.

How can I replace it?

Sources:

EDITED

This is the original string:

"SYSTEM OF MONITORING CONTINUES OF GLUCOSE"

in 0x... notation

SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE

Solution

Go to the Unicode code converter. Look at the conversions and do the replace.

In my case, I do a simple replace:

 string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE";
 //value contains non-breaking whitespace
 //value is "SYSTEM OF�MONITORING CONTINUES OF GLUCOSE"
 string cleaned = "";
 string pattern = @"[^\u0000-\u007F]+";
 string replacement = " ";

 Regex rgx = new Regex(pattern);
 cleaned = rgx.Replace(value, replacement);

 if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){
    //all code for insert
 else
    //Error messages

This expression represents all possible spaces: space, tab, page break, line break and carriage return

[ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000]

References

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
diegobarriosdev
  • 409
  • 1
  • 6
  • 20
  • 2
    Chances are the problem occurred before you got it as a string, as part of the process of decoding from bytes to text. You haven't shown us that though. – Jon Skeet May 16 '17 at 13:48
  • If you're just trying to clean a file, you can do that in notepad++ if you're not trying to do it programatically, – johnny 5 May 16 '17 at 13:48
  • 1
    The symbol is part of `\p{S}` Unicode category class. Just try `Regex.Replace(str, @"\p{S}+", "")`. If it does not work, the string just does not contain that symbol, and the problem is out there. Note that some of your tries (`@"[^\S\r\n]+"`, `@"\s+"` (that char is not whitespace) and `"^[\S\r\n0-9A-Za-z().:%-/ ]+$"` (adding `\S` makes it match all non-witespace chars, and you should have used a verbatim string literal here) do not make sense. Trimming does not make sense either as the char is not in a leading/trailing position. – Wiktor Stribiżew May 16 '17 at 13:50
  • @johnny-5 I need to program it, the problem is the clients, they fill the .csv files – diegobarriosdev May 16 '17 at 13:51
  • Please paste the exact original string you have into the question body. You wrote: *ALl spaces aren't normal space " "* BUT after I copied the string I only see regular spaces (`\x20`). – Wiktor Stribiżew May 16 '17 at 13:56
  • Please add the real, exact, input string. Without it, we cannot help, only you may solve the issue. Go to http://r12a.github.io/apps/conversion/, paste the string there, check in UniView what characters your string consists of, and then remove the actual char with appropriate code. – Wiktor Stribiżew May 16 '17 at 14:01
  • @wiktor-stribiżew this is the original string: "SYSTEM OF MONITORING CONTINUES OF GLUCOSE". Clients copy and paste from internet. – diegobarriosdev May 16 '17 at 14:14
  • @wiktor-stribiżew this is on 0x notation "SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE" and U-Hex "SYSTEM OFU+00A0MONITORING CONTINUES OF GLUCOSE" – diegobarriosdev May 16 '17 at 14:21
  • Ok, but `A0` is a non-breaking whitespace, and you may easily remove it with `Regex.Replace(s, @"\s+", "")`. Sorry, the question - regardless how many upvotes it gets - is off-topic right now. I need to go now, I will check via mobile. – Wiktor Stribiżew May 16 '17 at 14:22
  • It might be better to say, how can I strip out any characters which are not readable ASCII. To that end, I believe my answer does, just that. :-) – ΩmegaMan May 16 '17 at 14:26
  • I'm really grateful with the help. I'm just starting like programmer. Greetings from Colombia. – diegobarriosdev May 16 '17 at 14:37
  • Sorry, so your Attempt #2 is the Solution? – Wiktor Stribiżew May 16 '17 at 15:56
  • "problem is the clients, they fill the .csv files" and they do so by writing to the file with a **specific character encoding**. You need to **read it with the same one**. This is the same for all types of text files. Once you have read it into one or more .NET Strings it is Unicode and you can modify per your requirements and write it back out using a character encoding of your choosing (including ones that don't support all the characters in the text—which would result in data loss). – Tom Blodget May 16 '17 at 16:07
  • @Wiktor Stribiżew I update the solution. Thanks so very much :-) – diegobarriosdev May 16 '17 at 17:13
  • 2
    In case it wasn't clear, those bytes are a Byte Order Marker (BOM) and are part of a Unicode encoding format. Reading the string as the proper Unicode may fix this. Or, the marker may have been added multiple times as people incorrectly modified the file. Either way, beware that removing it without understanding it may cause future issues. – Michael Dorgan May 16 '17 at 17:26
  • Do you consider clean strings valid if they contain `'`, `&`, `,`, `+` or `*`? Your current regex does. – Wiktor Stribiżew May 16 '17 at 20:47
  • I'm thinking in use this [\f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000] form validate all posible space: space, tab, page break, line break, and carriage return – diegobarriosdev May 18 '17 at 14:51

3 Answers3

4

Using String.Replace:

Use a simple String.Replace().

I've assumed that the only characters you want to remove are the ones you've mentioned in the question: � and you want to replace them by a normal space.

string text = "imp�ortant";
string cleaned = text.Replace('\u00ef', ' ')
        .Replace('\u00bf', ' ')
        .Replace('\u00bd', ' ');
// Returns 'imp   ortant'

Or using Regex.Replace:

string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " ");
// Returns 'imp   ortant'

Try it out: Dotnet Fiddle

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
degant
  • 4,861
  • 1
  • 17
  • 29
  • This doesn't work. It's actually a single character U+FFFD (decimal 65533) �. It's weird c# spits out "�" and my hex editor displayed it from the source as U+00B7. More info: https://stackoverflow.com/a/1488920 – Tim Jan 30 '20 at 20:46
4

Define a range of ASCII characters, and replace anything that is not within that range.


We want to find only Unicode characters, so we will match on a Unicode character and replace.

Regex.Replace("This is my te\uFFFDxt from csv file", @"[^\u0000-\u007F]+", " ")

The above pattern will match anything that is not ^ in the set [ ] of this range \u0000-\u007F (ASCII characters (everything past \u007F is Unicode)) and replace it with a space.

Result

This is my te xt from csv file

You can adjust the range provided \u0000-\u007F as needed to expand the range of allowed characters to suit your needs.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
0

If you just want ASCII then try the following:

var ascii = new ASCIIEncoding();
byte[] encodedBytes = ascii.GetBytes(text);
var cleaned = ascii.GetString(encodedBytes).Replace("?", " ");
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
dove
  • 20,469
  • 14
  • 82
  • 108