How can you strip non-ASCII characters from a string? (in C#)
17 Answers
string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);
The ^
is the not operator. It tells the regex to find everything that doesn't match, instead of everything that does match. The \u####-\u####
says which characters match.\u0000-\u007F
is the equivalent of the first 128 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches.
(as explained in a comment by Gordon Tucker Dec 11, 2009 at 21:11)

- 11,743
- 10
- 52
- 81

- 8,265
- 5
- 25
- 28
-
57Range for printable characters is 0020-007E, for people looking for regular expression to replace non-printable characters – Mubashar Feb 17 '14 at 04:40
-
If you wish to see a table of the ASCII character set: http://www.asciitable.com/ – scradam Feb 26 '15 at 15:06
-
Range for **extended ASCII** is \u0000-\u00FF, for people looking for regular expression to replace non extended ASCII characters (i.e. for apps with Spanish language, diacritics etc...) – full_prog_full Dec 29 '15 at 21:30
-
4@GordonTucker \u0000-\u007F is the equivilent of the **first 127 characters** in utf-8 or unicode and NOT the first 225. See [table](http://www.ascii-code.com/) – full_prog_full Dec 29 '15 at 21:33
-
4@full_prog_full Which is why I replied to myself about a minute later correcting myself to say it was 127 and not 255. :) – Gordon Tucker Dec 30 '15 at 21:46
-
But 0000-0010 also containts non-ASCII characters like NUL,SOH,STX etc – Haseeb Mir May 03 '20 at 02:50
-
Here is a pure .NET solution that doesn't use regular expressions:
string inputString = "Räksmörgås";
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(inputString)
)
);
It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.
-
8Perfect! I'm using this to clean a string before saving it to a RTF document. Very much appreciated. Much easier to understand than the Regex version. – Nathan Prather Oct 06 '09 at 16:48
-
24You really find it easier to understand? To me, all the stuff that's not really relevant (fallbacks, conversions to bytes etc) is drawing the attention away from what actually happens. – bzlm Oct 11 '09 at 15:28
-
10@Brandon, actually, this technique doesn't do the job better than other techniques. So the analogy would be using a plain olde screwdriver instead of a fancy iScrewDriver Deluxe 2000. :) – bzlm Aug 04 '11 at 07:46
-
@bzim It's like using a hammer on a screw :) OK not. So it's like using the crankshaft of your car engine to drive a screw. There we go. – Brandon Aug 22 '11 at 17:10
-
-
@InsidiousForce, probably depends on which regular expression you use. Why don't you take one of the expressions from one of the answers to this question and benchmark it? :) – bzlm May 27 '13 at 08:36
-
16One advantage is that I can easily replace ASCII with ISO 8859-1 or another encoding :) – Akira Yamamoto Jul 04 '13 at 03:34
-
We have a Foxpro DB that our system uses, that gets corrupted for a pasttime. Since this function is run on almost every field of every row I was curious to know the performance difference and if there was anything better than plain regexp. For 1,000 randomly generated unicode strings the run times are `Regexp: Avg: 3~4ms, Max: 4ms` and `Encoding Conversion: Avg: 4~5ms, Max: 7ms` (not including string generation, that is outside the timer) – syserr0r Jul 16 '13 at 11:46
-
@syserr0r Interesting. This technique could probably be optimized, depending on what's taking time. The 2 Fallback instances could be re-used, for example. – bzlm Aug 05 '13 at 12:20
-
I'm finding this to be faster than the regex on smaller strings (they are nearly even on 1000 character string) and slower on larger strings – msmucker0527 Oct 02 '14 at 15:17
-
Wondering if I could use this somehow to replace non-ascii characters with replacement character. for example: `á` would be replaced with `a`. Is this possible? – CularBytes Dec 28 '15 at 12:42
-
1@RageCompex The EncoderReplacementFallback wasn't designed for conversion. But what you want can be achieved using the .NET APIs for Unicode Normalization and Canonicalization. – bzlm Dec 30 '15 at 11:19
I believe MonsCamus meant:
parsememo = Regex.Replace(parsememo, @"[^\u0020-\u007E]", string.Empty);
-
7IMHO This answer is better than the accepted answer because it strips out control characters. – Dean2690 Sep 25 '17 at 14:30
If you want not to strip, but to actually convert latin accented to non-accented characters, take a look at this question: How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)
-
1I didn't even realize this was possible, but it's a much better solution for me. I'm going to add this link to a comment on the question to make it easier for other people to find. Thanks! – Bobson Dec 10 '13 at 15:36
Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution
public static string PureAscii(this string source, char nil = ' ')
{
var min = '\u0000';
var max = '\u007F';
return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}
public static string ToText(this IEnumerable<char> source)
{
var buffer = new StringBuilder();
foreach (var c in source)
buffer.Append(c);
return buffer.ToString();
}
This is untested code.

- 8,939
- 2
- 28
- 33

- 5,538
- 9
- 44
- 63
-
7Instead of the separate ToText() method, how about replacing line 3 of PureAscii() with: return new string(source.Select(c => c < min ? nil : c > max ? nil : c).ToArray()); – agentnega Nov 10 '11 at 05:51
-
Or perhaps ToText as: return (new string(source)).ToArray() - depending on what performs best. It's still nice to have ToText as an extension method - fluent/pipeline style. :-) – Bent Rasmussen Jan 15 '16 at 10:14
-
That code replaces non-ASCII characters with a space. To strip them out, change Select to Where: `return new string( source.Where( c => c >= min && c <= max ).ToArray() );` – Foozinator May 17 '17 at 20:53
-
@Foozinator That code allows you to specify which character to replace the non-ASCII characters with. By default it uses a space, but if it's called like .PureASCII(Char.MinValue), it will replace all non-ASCII with '\0' - which still isn't exactly stripping them, but similar results. – Ulfius Nov 29 '17 at 16:42
-
The ToText method can be removed, and line 5 can be replaced by: `return source.Where(c => c >= min && c <= max).Aggregate(new StringBuilder(), (sb, s) => sb.Append(s), sb => sb.ToString());` – Joakim M. H. Aug 13 '19 at 06:33
I found the following slightly altered range useful for parsing comment blocks out of a database, this means that you won't have to contend with tab and escape characters which would cause a CSV field to become upset.
parsememo = Regex.Replace(parsememo, @"[^\u001F-\u007F]", string.Empty);
If you want to avoid other special characters or particular punctuation check the ascii table

- 131
- 1
- 4
-
1In case anyone hasn't noticed the other comments, the printable characters are actually @"[^\u0020-\u007E]". Here's a link to see the table if you're curious: http://www.asciitable.com/ – scradam Feb 26 '15 at 15:03
no need for regex. just use encoding...
sOutput = System.Text.Encoding.ASCII.GetString(System.Text.Encoding.ASCII.GetBytes(sInput));
-
7This does not work. This does not strip unicode characters, it replaces them with the ? character. – David Feb 27 '14 at 16:56
-
1@David is right. At least I got `????nacho??` when I tried: `たまねこnachoなち` in mono 3.4 – nacho4d Aug 06 '14 at 02:38
-
2You can instantiate your own Encoding class that instead of replacing characters it removes them. See the GetEncoding method: https://msdn.microsoft.com/en-us/library/89856k4b(v=vs.110).aspx – kkara Apr 01 '16 at 13:52
I came here looking for a solution for extended ascii characters, but couldnt find it. The closest I found is bzlm's solution. But that works only for ASCII Code upto 127(obviously you can replace the encoding type in his code, but i think it was a bit complex to understand. Hence, sharing this version). Here's a solution that works for extended ASCII codes i.e. upto 255 which is the ISO 8859-1
It finds and strips out non-ascii characters(greater than 255)
Dim str1 as String= "â, ??î or ôu� n☁i✑++$-♓!‼⁉4⃣od;/⏬'®;☕:☝)///1!@#"
Dim extendedAscii As Encoding = Encoding.GetEncoding("ISO-8859-1",
New EncoderReplacementFallback(String.empty),
New DecoderReplacementFallback())
Dim extendedAsciiBytes() As Byte = extendedAscii.GetBytes(str1)
Dim str2 As String = extendedAscii.GetString(extendedAsciiBytes)
console.WriteLine(str2)
'Output : â, ??î or ôu ni++$-!‼⁉4od;/';:)///1!@#$%^yz:
Here's a working fiddle for the code
Replace the encoding as per the requirement, rest should remain the same.

- 1
- 1

- 5,020
- 20
- 37
-
3The only one that worked to remove ONLY the Ω from this string "Ω c ç ã". Thank you very much! – Rafael Araújo May 08 '19 at 00:19
This is not optimal performance-wise, but a pretty straight-forward Linq approach:
string strippedString = new string(
yourString.Where(c => c <= sbyte.MaxValue).ToArray()
);
The downside is that all the "surviving" characters are first put into an array of type char[]
which is then thrown away after the string
constructor no longer uses it.

- 60,409
- 11
- 110
- 181
I used this regex expression:
string s = "søme string";
Regex regex = new Regex(@"[^a-zA-Z0-9\s]", (RegexOptions)0);
return regex.Replace(s, "");

- 290
- 2
- 2
-
16This removes punctuation as well, just in case that's not what someone wants. – Drew Noakes Jul 18 '12 at 08:43
I use this regular expression to filter out bad characters in a filename.
Regex.Replace(directory, "[^a-zA-Z0-9\\:_\- ]", "")
That should be all the characters allowed for filenames.

- 1,315
- 15
- 15
-
3Nope. See [Path.GetInvalidPathChars](https://msdn.microsoft.com/en-us/library/system.io.path.getinvalidpathchars(v=vs.110).aspx) and [Path.GetInvalidFileNameChars](https://msdn.microsoft.com/en-us/library/system.io.path.getinvalidfilenamechars(v=vs.110).aspx). So, there are tens of thousands of valid characters. – Tom Blodget Jun 10 '17 at 00:04
-
You are correct, Tom. I was actually thinking of the common ones, but I left out parenthesis and curly braces as well as all these - ^%$#@!&+=. – user890332 Jun 12 '17 at 20:02
public string ReturnCleanASCII(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
foreach (char c in s)
{
if ((int)c > 127) // you probably don't want 127 either
continue;
if ((int)c < 32) // I bet you don't want control characters
continue;
if (c == '%')
continue;
if (c == '?')
continue;
sb.Append(c);
}
return sb.ToString();
}

- 226
- 1
- 11
If you want a string with only ISO-8859-1 characters and excluding characters which are not standard, you should use this expression :
var result = Regex.Replace(value, @"[^\u0020-\u007E\u00A0-\u00FF]+", string.Empty);
Note : Using Encoding.GetEncoding("ISO-8859-1") method will not do the job because undefined characters are not excluded.
Wikipedia ISO-8859-1 code page for more details.

- 1
- 1
I did a bit of testing, and @bzlm 's answer is the fastest valid answer.
But it turns out we can do much faster.
The conversion using encoding is equivalent to the following code when inlining Encoding.Convert
public static string StripUnicode(string unicode) {
Encoding dstEncoding = GreedyAscii;
Encoding srcEncoding = Encoding.UTF8;
return dstEncoding.GetString(dstEncoding.GetBytes(srcEncoding.GetChars(srcEncoding.GetBytes(unicode))));
}
As you can clearly see we perform two redundant actions by reencoding UTF8. Why is that you may ask? C# exclusively stores strings in UTF16 graphmemes. These can ofc also be UTF8 graphmemes, since unicode is intercompatible. (Sidenote: @bzlm 's solution breaks UTF16 characters which may throw an exception during transcoding.) => The operation is independant of the source encoding, since it always is UTF16.
Lets get rid of the redundant reencoding, and prevent edgecase failures.
public static string StripUnicode(string unicode) {
Encoding dstEncoding = GreedyAscii;
return dstEncoding.GetString(dstEncoding.GetBytes(unicode));
}
We alreadly have a simplified and perfectly workable solution. Which requries less then half as much time to compute.
There is not much more performance to be gained, but for further memory optimization we can do two things:
- Accept a
ReadOnlySpan<char>
for a more usable api. - Attempt to fit the tempoary
byte[]
unto the stack; otherwise use an array pool.
public static string StripUnicode(ReadOnlySpan<char> unicode) {
return EnsureEncoding(unicode, GreedyAscii);
}
/// <summary>Produces a string which is compatible with the limiting encoding</summary>
/// <remarks>Ensure that the encoding does not throw on illegal characters</remarks>
public static string EnsureEncoding(ReadOnlySpan<char> unicode, Encoding limitEncoding) {
int asciiBytesLength = limitEncoding.GetMaxByteCount(unicode.Length);
byte[]? asciiBytes = asciiBytesLength <= 2048 ? null : ArrayPool<byte>.Shared.Rent(asciiBytesLength);
Span<byte> asciiSpan = asciiBytes ?? stackalloc byte[asciiBytesLength];
asciiBytesLength = limitEncoding.GetBytes(unicode, asciiSpan);
asciiSpan = asciiSpan.Slice(0, asciiBytesLength);
string asciiChars = limitEncoding.GetString(asciiSpan);
if (asciiBytes is { }) {
ArrayPool<byte>.Shared.Return(asciiBytes);
}
return asciiChars;
}
private static Encoding GreedyAscii { get; } = Encoding.GetEncoding(Encoding.ASCII.EncodingName, new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());
You can see this snipped in action on sharplab.io

- 530
- 3
- 17
Just decode unicode using by Regex.Unescape(s)

- 13
- 3
-
As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 17 '23 at 17:33
You can use Char.IsAscii
to identify the characters you
want to keep. A simple implementation might look like:
public static string StripNonAscii(this string input)
{
StringBuilder resultBuilder = new();
foreach (char character in input)
if (char.IsAscii(character))
resultBuilder.Append(character);
return resultBuilder.ToString();
}

- 7,286
- 6
- 49
- 114
Necromancing.
Also, the method by bzlm can be used to remove characters that are not in an arbitrary charset, not just ASCII:
// https://en.wikipedia.org/wiki/Code_page#EBCDIC-based_code_pages
// https://en.wikipedia.org/wiki/Windows_code_page#East_Asian_multi-byte_code_pages
// https://en.wikipedia.org/wiki/Chinese_character_encoding
System.Text.Encoding encRemoveAllBut = System.Text.Encoding.ASCII;
encRemoveAllBut = System.Text.Encoding.GetEncoding(System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage); // System-encoding
encRemoveAllBut = System.Text.Encoding.GetEncoding(1252); // Western European (iso-8859-1)
encRemoveAllBut = System.Text.Encoding.GetEncoding(1251); // Windows-1251/KOI8-R
encRemoveAllBut = System.Text.Encoding.GetEncoding("ISO-8859-5"); // used by less than 0.1% of websites
encRemoveAllBut = System.Text.Encoding.GetEncoding(37); // IBM EBCDIC US-Canada
encRemoveAllBut = System.Text.Encoding.GetEncoding(500); // IBM EBCDIC Latin 1
encRemoveAllBut = System.Text.Encoding.GetEncoding(936); // Chinese Simplified
encRemoveAllBut = System.Text.Encoding.GetEncoding(950); // Chinese Traditional
encRemoveAllBut = System.Text.Encoding.ASCII; // putting ASCII again, as to answer the question
// https://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c
string inputString = "RäksmörПривет, мирgås";
string asAscii = encRemoveAllBut.GetString(
System.Text.Encoding.Convert(
System.Text.Encoding.UTF8,
System.Text.Encoding.GetEncoding(
encRemoveAllBut.CodePage,
new System.Text.EncoderReplacementFallback(string.Empty),
new System.Text.DecoderExceptionFallback()
),
System.Text.Encoding.UTF8.GetBytes(inputString)
)
);
System.Console.WriteLine(asAscii);
AND for those that just want to remote the accents:
(caution, because Normalize != Latinize != Romanize)
// string str = Latinize("(æøå âôû?aè");
public static string Latinize(string stIn)
{
// Special treatment for German Umlauts
stIn = stIn.Replace("ä", "ae");
stIn = stIn.Replace("ö", "oe");
stIn = stIn.Replace("ü", "ue");
stIn = stIn.Replace("Ä", "Ae");
stIn = stIn.Replace("Ö", "Oe");
stIn = stIn.Replace("Ü", "Ue");
// End special treatment for German Umlauts
string stFormD = stIn.Normalize(System.Text.NormalizationForm.FormD);
System.Text.StringBuilder sb = new System.Text.StringBuilder();
for (int ich = 0; ich < stFormD.Length; ich++)
{
System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
{
sb.Append(stFormD[ich]);
} // End if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
} // Next ich
//return (sb.ToString().Normalize(System.Text.NormalizationForm.FormC));
return (sb.ToString().Normalize(System.Text.NormalizationForm.FormKC));
} // End Function Latinize

- 78,642
- 66
- 377
- 442