49

I have a C# routine that imports data from a CSV file, matches it against a database and then rewrites it to a file. The source file seems to have a few non-ASCII characters that are fouling up the processing routine.

I already have a static method that I run each input field through but it performs basic checks like removing commas and quotes. Does anybody know how I could add functionality that removes non-ASCII characters too?

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
user135498
  • 6,013
  • 8
  • 29
  • 29

8 Answers8

58

Here a simple solution:

public static bool IsASCII(this string value)
{
    // ASCII encoding replaces non-ascii with question marks, so we use UTF8 to see if multi-byte sequences are there
    return Encoding.UTF8.GetByteCount(value) == value.Length;
}

source: http://snipplr.com/view/35806/

Jaider
  • 14,268
  • 5
  • 75
  • 82
  • 4
    This solution has the benefit of working in portable class libraries, where Encoding.ASCII is not available. – Stephen Rudolph Jul 07 '14 at 16:09
  • 4
    It also has the benefit of being a lot faster than the accepted solution because it does not need to actually create an encoded string. – Roman Starkov Oct 13 '14 at 22:56
  • 8
    -1; the question asked for "functionality that removes non-ASCII characters", which this doesn't do. The *title* was ambiguous, but the solution to that is to clarify the title (which I've done), not to answer a question that the OP didn't ask. This might be a good answer to a different question than the one you've posted it on, but is a non-answer to the one you did. – Mark Amery May 05 '17 at 12:31
  • you are genius! – Malik Khalil Oct 09 '19 at 12:54
45
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Raktim Biswas
  • 4,011
  • 5
  • 27
  • 32
EToreo
  • 2,936
  • 4
  • 30
  • 36
  • 15
    Important to note that using asciiencoding will replace all non-ascii characters with '?'(63), which may or may not be what you want or expect. – captncraig Nov 12 '12 at 19:58
  • 12
    furthermore, you can check if it contains only ASCII, if `s == sOut` – Jaider Dec 12 '12 at 21:46
15

Do it all at once

public string ReturnCleanASCII(string s)
{
    StringBuilder sb = new StringBuilder(s.Length);
    foreach(char c in s)
    {
       if((int)c > 127) // you probably don't want 127 either
          continue;
       if((int)c < 32)  // I bet you don't want control characters 
          continue;
       if(c == ',')
          continue;
       if(c == '"')
          continue;
       sb.Append(c);
    }
    return sb.ToString();
}
Community
  • 1
  • 1
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • I would want tab, line feed and carriage return (9, 10, 13), so I just added `if ((int)c == 9 || (int)c == 10 || (int)c == 13)` as the first if and append it. – Skillie Oct 29 '18 at 07:17
8

If you wanted to test a specific character, you could use

if ((int)myChar <= 127)

Just getting the ASCII encoding of the string will not tell you that a specific character was non-ASCII to begin with (if you care about that). See MSDN.

Eric J.
  • 147,927
  • 63
  • 340
  • 553
7

Here's an improvement upon the accepted answer:

string fallbackStr = "";

Encoding enc = Encoding.GetEncoding(Encoding.ASCII.CodePage,
  new EncoderReplacementFallback(fallbackStr),
  new DecoderReplacementFallback(fallbackStr));

string cleanStr = enc.GetString(enc.GetBytes(inputStr));

This method will replace unknown characters with the value of fallbackStr, or if fallbackStr is empty, leave them out entirely. (Note that enc can be defined outside the scope of a function.)

rookie1024
  • 612
  • 7
  • 18
2

It sounds kind of strange that it's accepted to drop the non-ASCII.

Also I always recommend the excellent FileHelpers library for parsing CSV-files.

Jonas Elfström
  • 30,834
  • 6
  • 70
  • 106
1
strText = Regex.Replace(strText, @"[^\u0020-\u007E]", string.Empty);
jmoerdyk
  • 5,544
  • 7
  • 38
  • 49
Chintoo
  • 11
  • 1
  • 1
    Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, **can you [edit] your answer to include an explanation of what you're doing** and why you believe it is the best approach? – Jeremy Caney Apr 14 '22 at 02:13
0
    public string RunCharacterCheckASCII(string s)
    {
        string str = s;
        bool is_find = false;
        char ch;
        int ich = 0;
        try
        {
            char[] schar = str.ToCharArray();
            for (int i = 0; i < schar.Length; i++)
            {
                ch = schar[i];
                ich = (int)ch;
                if (ich > 127) // not ascii or extended ascii
                {
                    is_find = true;
                    schar[i] = '?';
                }
            }
            if (is_find)
                str = new string(schar);
        }
        catch (Exception ex)
        {
        }
        return str;
    }