How can you strip non-ASCII characters from a string? (in C#)

Question

Per sinelaw's answer [below](http://stackoverflow.com/a/10036919/298754), if you instead want to *replace* non-ASCII characters, **see [this answer](http://stackoverflow.com/a/10036907/562906) instead**. — Bobson, Dec 10 '13 at 15:37

score 485 · Accepted Answer · edited Jan 31 '23 at 20:41

485

string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

The ^ is the not operator. It tells the regex to find everything that doesn't match, instead of everything that does match. The \u####-\u#### says which characters match.\u0000-\u007F is the equivalent of the first 128 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches.

(as explained in a comment by Gordon Tucker Dec 11, 2009 at 21:11)

edited Jan 31 '23 at 20:41

StayOnTarget

11,743
10
52
81

answered Sep 23 '08 at 19:46

philcruz

8,265
5
25
28

57

Range for printable characters is 0020-007E, for people looking for regular expression to replace non-printable characters – Mubashar Feb 17 '14 at 04:40
If you wish to see a table of the ASCII character set: http://www.asciitable.com/ – scradam Feb 26 '15 at 15:06
Range for **extended ASCII** is \u0000-\u00FF, for people looking for regular expression to replace non extended ASCII characters (i.e. for apps with Spanish language, diacritics etc...) – full_prog_full Dec 29 '15 at 21:30
4

@GordonTucker \u0000-\u007F is the equivilent of the **first 127 characters** in utf-8 or unicode and NOT the first 225. See [table](http://www.ascii-code.com/) – full_prog_full Dec 29 '15 at 21:33
4

@full_prog_full Which is why I replied to myself about a minute later correcting myself to say it was 127 and not 255. :) – Gordon Tucker Dec 30 '15 at 21:46
But 0000-0010 also containts non-ASCII characters like NUL,SOH,STX etc – Haseeb Mir May 03 '20 at 02:50
what about this Regex.Replace(str, @"\p{C}+", string.Empty); – Haseeb Mir May 03 '20 at 04:16

score 157 · Answer 2 · edited Apr 24 '19 at 06:14

157

Here is a pure .NET solution that doesn't use regular expressions:

string inputString = "Räksmörgås";
string asAscii = Encoding.ASCII.GetString(
    Encoding.Convert(
        Encoding.UTF8,
        Encoding.GetEncoding(
            Encoding.ASCII.EncodingName,
            new EncoderReplacementFallback(string.Empty),
            new DecoderExceptionFallback()
            ),
        Encoding.UTF8.GetBytes(inputString)
    )
);

It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.

edited Apr 24 '19 at 06:14

Johnny

8,939
2
28
33

answered Sep 25 '08 at 19:32

bzlm

9,626
6
65
92

8

Perfect! I'm using this to clean a string before saving it to a RTF document. Very much appreciated. Much easier to understand than the Regex version. – Nathan Prather Oct 06 '09 at 16:48
24

You really find it easier to understand? To me, all the stuff that's not really relevant (fallbacks, conversions to bytes etc) is drawing the attention away from what actually happens. – bzlm Oct 11 '09 at 15:28
10

@Brandon, actually, this technique doesn't do the job better than other techniques. So the analogy would be using a plain olde screwdriver instead of a fancy iScrewDriver Deluxe 2000. :) – bzlm Aug 04 '11 at 07:46
@bzim It's like using a hammer on a screw :) OK not. So it's like using the crankshaft of your car engine to drive a screw. There we go. – Brandon Aug 22 '11 at 17:10
How slow is this compared to regex? Regex is pretty fast. – Kurt Koller May 23 '13 at 21:09
@InsidiousForce, probably depends on which regular expression you use. Why don't you take one of the expressions from one of the answers to this question and benchmark it? :) – bzlm May 27 '13 at 08:36
16

One advantage is that I can easily replace ASCII with ISO 8859-1 or another encoding :) – Akira Yamamoto Jul 04 '13 at 03:34
We have a Foxpro DB that our system uses, that gets corrupted for a pasttime. Since this function is run on almost every field of every row I was curious to know the performance difference and if there was anything better than plain regexp. For 1,000 randomly generated unicode strings the run times are `Regexp: Avg: 3~4ms, Max: 4ms` and `Encoding Conversion: Avg: 4~5ms, Max: 7ms` (not including string generation, that is outside the timer) – syserr0r Jul 16 '13 at 11:46
@syserr0r Interesting. This technique could probably be optimized, depending on what's taking time. The 2 Fallback instances could be re-used, for example. – bzlm Aug 05 '13 at 12:20
I'm finding this to be faster than the regex on smaller strings (they are nearly even on 1000 character string) and slower on larger strings – msmucker0527 Oct 02 '14 at 15:17
Wondering if I could use this somehow to replace non-ascii characters with replacement character. for example: `á` would be replaced with `a`. Is this possible? – CularBytes Dec 28 '15 at 12:42
1

@RageCompex The EncoderReplacementFallback wasn't designed for conversion. But what you want can be achieved using the .NET APIs for Unicode Normalization and Canonicalization. – bzlm Dec 30 '15 at 11:19

score 62 · Answer 3 · edited Feb 25 '15 at 20:46

62

I believe MonsCamus meant:

parsememo = Regex.Replace(parsememo, @"[^\u0020-\u007E]", string.Empty);

edited Feb 25 '15 at 20:46

Adam Lear

38,111
12
81
101

answered Aug 02 '13 at 13:31

Josh

834
7
13

7

IMHO This answer is better than the accepted answer because it strips out control characters. – Dean2690 Sep 25 '17 at 14:30

score 18 · Answer 4 · edited May 23 '17 at 12:26

18

If you want not to strip, but to actually convert latin accented to non-accented characters, take a look at this question: How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)

edited May 23 '17 at 12:26

Community

1
1

answered Apr 05 '12 at 22:30

sinelaw

16,205
3
49
80

1

I didn't even realize this was possible, but it's a much better solution for me. I'm going to add this link to a comment on the question to make it easier for other people to find. Thanks! – Bobson Dec 10 '13 at 15:36

score 13 · Answer 5 · edited Apr 24 '19 at 06:15

13

Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution

public static string PureAscii(this string source, char nil = ' ')
{
    var min = '\u0000';
    var max = '\u007F';
    return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}

public static string ToText(this IEnumerable<char> source)
{
    var buffer = new StringBuilder();
    foreach (var c in source)
        buffer.Append(c);
    return buffer.ToString();
}

This is untested code.

edited Apr 24 '19 at 06:15

Johnny

8,939
2
28
33

answered Jan 27 '10 at 19:00

Bent Rasmussen

5,538
9
44
63

7

Instead of the separate ToText() method, how about replacing line 3 of PureAscii() with: return new string(source.Select(c => c < min ? nil : c > max ? nil : c).ToArray()); – agentnega Nov 10 '11 at 05:51
Or perhaps ToText as: return (new string(source)).ToArray() - depending on what performs best. It's still nice to have ToText as an extension method - fluent/pipeline style. :-) – Bent Rasmussen Jan 15 '16 at 10:14
That code replaces non-ASCII characters with a space. To strip them out, change Select to Where: `return new string( source.Where( c => c >= min && c <= max ).ToArray() );` – Foozinator May 17 '17 at 20:53
@Foozinator That code allows you to specify which character to replace the non-ASCII characters with. By default it uses a space, but if it's called like .PureASCII(Char.MinValue), it will replace all non-ASCII with '\0' - which still isn't exactly stripping them, but similar results. – Ulfius Nov 29 '17 at 16:42
The ToText method can be removed, and line 5 can be replaced by: `return source.Where(c => c >= min && c <= max).Aggregate(new StringBuilder(), (sb, s) => sb.Append(s), sb => sb.ToString());` – Joakim M. H. Aug 13 '19 at 06:33

score 5 · Answer 6 · answered Oct 01 '12 at 10:02

5

I found the following slightly altered range useful for parsing comment blocks out of a database, this means that you won't have to contend with tab and escape characters which would cause a CSV field to become upset.

parsememo = Regex.Replace(parsememo, @"[^\u001F-\u007F]", string.Empty);

If you want to avoid other special characters or particular punctuation check the ascii table

answered Oct 01 '12 at 10:02

MonsCamus

131
1
4

1

In case anyone hasn't noticed the other comments, the printable characters are actually @"[^\u0020-\u007E]". Here's a link to see the table if you're curious: http://www.asciitable.com/ – scradam Feb 26 '15 at 15:03

score 5 · Answer 7 · edited Jun 18 '13 at 17:56

5

no need for regex. just use encoding...

sOutput = System.Text.Encoding.ASCII.GetString(System.Text.Encoding.ASCII.GetBytes(sInput));

edited Jun 18 '13 at 17:56

Jon Lin

142,182
29
220
220

answered Jun 18 '13 at 17:38

rjp

91
1
1

7

This does not work. This does not strip unicode characters, it replaces them with the ? character. – David Feb 27 '14 at 16:56
1

@David is right. At least I got `????nacho??` when I tried: `たまねこnachoなち` in mono 3.4 – nacho4d Aug 06 '14 at 02:38
2

You can instantiate your own Encoding class that instead of replacing characters it removes them. See the GetEncoding method: https://msdn.microsoft.com/en-us/library/89856k4b(v=vs.110).aspx – kkara Apr 01 '16 at 13:52

score 5 · Answer 8 · edited May 23 '17 at 12:26

I came here looking for a solution for extended ascii characters, but couldnt find it. The closest I found is bzlm's solution. But that works only for ASCII Code upto 127(obviously you can replace the encoding type in his code, but i think it was a bit complex to understand. Hence, sharing this version). Here's a solution that works for extended ASCII codes i.e. upto 255 which is the ISO 8859-1

It finds and strips out non-ascii characters(greater than 255)

Dim str1 as String= "â, ??î or ôu� n☁i✑++$-♓!‼⁉4⃣od;/⏬'®;☕:☝)///1!@#"

Dim extendedAscii As Encoding = Encoding.GetEncoding("ISO-8859-1", 
                                                New EncoderReplacementFallback(String.empty),
                                                New DecoderReplacementFallback())

Dim extendedAsciiBytes() As Byte = extendedAscii.GetBytes(str1)

Dim str2 As String = extendedAscii.GetString(extendedAsciiBytes)

console.WriteLine(str2)
'Output : â, ??î or ôu ni++$-!‼⁉4od;/';:)///1!@#$%^yz:

Here's a working fiddle for the code

Replace the encoding as per the requirement, rest should remain the same.

The only one that worked to remove ONLY the Ω from this string "Ω c ç ã". Thank you very much! — Rafael Araújo, May 08 '19 at 00:19

score 3 · Answer 9 · answered Sep 03 '13 at 17:08

This is not optimal performance-wise, but a pretty straight-forward Linq approach:

string strippedString = new string(
    yourString.Where(c => c <= sbyte.MaxValue).ToArray()
    );

The downside is that all the "surviving" characters are first put into an array of type char[] which is then thrown away after the string constructor no longer uses it.

score 1 · Answer 10 · answered Jun 12 '12 at 12:27

1

I used this regex expression:

    string s = "søme string";
    Regex regex = new Regex(@"[^a-zA-Z0-9\s]", (RegexOptions)0);
    return regex.Replace(s, "");

answered Jun 12 '12 at 12:27

Anonymous coward

290
2
2

16

This removes punctuation as well, just in case that's not what someone wants. – Drew Noakes Jul 18 '12 at 08:43

score 1 · Answer 11 · answered Jun 09 '17 at 18:17

1

I use this regular expression to filter out bad characters in a filename.

Regex.Replace(directory, "[^a-zA-Z0-9\\:_\- ]", "")

That should be all the characters allowed for filenames.

answered Jun 09 '17 at 18:17

user890332

1,315
15
15

3

Nope. See [Path.GetInvalidPathChars](https://msdn.microsoft.com/en-us/library/system.io.path.getinvalidpathchars(v=vs.110).aspx) and [Path.GetInvalidFileNameChars](https://msdn.microsoft.com/en-us/library/system.io.path.getinvalidfilenamechars(v=vs.110).aspx). So, there are tens of thousands of valid characters. – Tom Blodget Jun 10 '17 at 00:04
You are correct, Tom. I was actually thinking of the common ones, but I left out parenthesis and curly braces as well as all these - ^%$#@!&+=. – user890332 Jun 12 '17 at 20:02

score 1 · Answer 12 · answered Jul 27 '22 at 08:18

public string ReturnCleanASCII(string s)
    {
        StringBuilder sb = new StringBuilder(s.Length);
        foreach (char c in s)
        {
            if ((int)c > 127) // you probably don't want 127 either
                continue;
            if ((int)c < 32)  // I bet you don't want control characters 
                continue;
            if (c == '%')
                continue;
            if (c == '?')
                continue;
            sb.Append(c);
        }
        return sb.ToString();
    }

score 0 · Answer 13 · answered Jul 16 '22 at 11:12

If you want a string with only ISO-8859-1 characters and excluding characters which are not standard, you should use this expression :

var result = Regex.Replace(value, @"[^\u0020-\u007E\u00A0-\u00FF]+", string.Empty);

Note : Using Encoding.GetEncoding("ISO-8859-1") method will not do the job because undefined characters are not excluded.

.Net Fiddle sample

Wikipedia ISO-8859-1 code page for more details.

Prophet Lamb · Answer 14 · 2023-02-26T16:09:48.173

I did a bit of testing, and @bzlm 's answer is the fastest valid answer. But it turns out we can do much faster. The conversion using encoding is equivalent to the following code when inlining Encoding.Convert

public static string StripUnicode(string unicode) {
    Encoding dstEncoding = GreedyAscii;
    Encoding srcEncoding = Encoding.UTF8;
    return dstEncoding.GetString(dstEncoding.GetBytes(srcEncoding.GetChars(srcEncoding.GetBytes(unicode))));
}

As you can clearly see we perform two redundant actions by reencoding UTF8. Why is that you may ask? C# exclusively stores strings in UTF16 graphmemes. These can ofc also be UTF8 graphmemes, since unicode is intercompatible. (Sidenote: @bzlm 's solution breaks UTF16 characters which may throw an exception during transcoding.) => The operation is independant of the source encoding, since it always is UTF16.

Lets get rid of the redundant reencoding, and prevent edgecase failures.

public static string StripUnicode(string unicode) {
    Encoding dstEncoding = GreedyAscii;
    return dstEncoding.GetString(dstEncoding.GetBytes(unicode));
}

We alreadly have a simplified and perfectly workable solution. Which requries less then half as much time to compute.

There is not much more performance to be gained, but for further memory optimization we can do two things:

Accept a ReadOnlySpan<char> for a more usable api.
Attempt to fit the tempoary byte[] unto the stack; otherwise use an array pool.

public static string StripUnicode(ReadOnlySpan<char> unicode) {
    return EnsureEncoding(unicode, GreedyAscii);
}

/// <summary>Produces a string which is compatible with the limiting encoding</summary>
/// <remarks>Ensure that the encoding does not throw on illegal characters</remarks>
public static string EnsureEncoding(ReadOnlySpan<char> unicode, Encoding limitEncoding) {
    int asciiBytesLength = limitEncoding.GetMaxByteCount(unicode.Length);
    byte[]? asciiBytes = asciiBytesLength <= 2048 ? null : ArrayPool<byte>.Shared.Rent(asciiBytesLength);
    Span<byte> asciiSpan = asciiBytes ?? stackalloc byte[asciiBytesLength];

    asciiBytesLength = limitEncoding.GetBytes(unicode, asciiSpan);
    asciiSpan = asciiSpan.Slice(0, asciiBytesLength);

    string asciiChars = limitEncoding.GetString(asciiSpan);
    if (asciiBytes is { }) {
        ArrayPool<byte>.Shared.Return(asciiBytes);
    }

    return asciiChars;
}

private static Encoding GreedyAscii { get; } = Encoding.GetEncoding(Encoding.ASCII.EncodingName, new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());

You can see this snipped in action on sharplab.io

score 0 · Answer 15 · answered Mar 10 '23 at 08:20

0

Just decode unicode using by Regex.Unescape(s)

answered Mar 10 '23 at 08:20

Alexander Ivanov

13
3

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 17 '23 at 17:33

score 0 · Answer 16 · answered Mar 23 '23 at 19:44

You can use Char.IsAscii to identify the characters you want to keep. A simple implementation might look like:

public static string StripNonAscii(this string input)
{
    StringBuilder resultBuilder = new();
    foreach (char character in input)
        if (char.IsAscii(character))
            resultBuilder.Append(character);
    return resultBuilder.ToString();
}

score -1 · Answer 17 · answered Jan 07 '21 at 00:19

Necromancing.
Also, the method by bzlm can be used to remove characters that are not in an arbitrary charset, not just ASCII:

// https://en.wikipedia.org/wiki/Code_page#EBCDIC-based_code_pages
// https://en.wikipedia.org/wiki/Windows_code_page#East_Asian_multi-byte_code_pages
// https://en.wikipedia.org/wiki/Chinese_character_encoding
System.Text.Encoding encRemoveAllBut = System.Text.Encoding.ASCII;
encRemoveAllBut = System.Text.Encoding.GetEncoding(System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage); // System-encoding
encRemoveAllBut = System.Text.Encoding.GetEncoding(1252); // Western European (iso-8859-1)
encRemoveAllBut = System.Text.Encoding.GetEncoding(1251); // Windows-1251/KOI8-R
encRemoveAllBut = System.Text.Encoding.GetEncoding("ISO-8859-5"); // used by less than 0.1% of websites
encRemoveAllBut = System.Text.Encoding.GetEncoding(37); // IBM EBCDIC US-Canada
encRemoveAllBut = System.Text.Encoding.GetEncoding(500); // IBM EBCDIC Latin 1
encRemoveAllBut = System.Text.Encoding.GetEncoding(936); // Chinese Simplified
encRemoveAllBut = System.Text.Encoding.GetEncoding(950); // Chinese Traditional
encRemoveAllBut = System.Text.Encoding.ASCII; // putting ASCII again, as to answer the question 

// https://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c
string inputString = "RäksmörПривет, мирgås";
string asAscii = encRemoveAllBut.GetString(
    System.Text.Encoding.Convert(
        System.Text.Encoding.UTF8,
        System.Text.Encoding.GetEncoding(
            encRemoveAllBut.CodePage,
            new System.Text.EncoderReplacementFallback(string.Empty),
            new System.Text.DecoderExceptionFallback()
            ),
        System.Text.Encoding.UTF8.GetBytes(inputString)
    )
);

System.Console.WriteLine(asAscii);

AND for those that just want to remote the accents:
(caution, because Normalize != Latinize != Romanize)

// string str = Latinize("(æøå âôû?aè");
public static string Latinize(string stIn)
{
    // Special treatment for German Umlauts
    stIn = stIn.Replace("ä", "ae");
    stIn = stIn.Replace("ö", "oe");
    stIn = stIn.Replace("ü", "ue");

    stIn = stIn.Replace("Ä", "Ae");
    stIn = stIn.Replace("Ö", "Oe");
    stIn = stIn.Replace("Ü", "Ue");
    // End special treatment for German Umlauts

    string stFormD = stIn.Normalize(System.Text.NormalizationForm.FormD);
    System.Text.StringBuilder sb = new System.Text.StringBuilder();

    for (int ich = 0; ich < stFormD.Length; ich++)
    {
        System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);

        if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[ich]);
        } // End if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)

    } // Next ich


    //return (sb.ToString().Normalize(System.Text.NormalizationForm.FormC));
    return (sb.ToString().Normalize(System.Text.NormalizationForm.FormKC));
} // End Function Latinize

How can you strip non-ASCII characters from a string? (in C#)

17 Answers17

Linked

Related