Remove non-ASCII characters from string

Question

I'm getting strange characters when pulling data from a website:

Â

How can I remove anything that isn't a non-extended ASCII character?

A more appropriate question can be found here: PHP - replace all non-alphanumeric chars for all languages supported

What do you mean when you say non-ascii, `Â` is an ascii character (#194) — Drew Galbraith, Jan 08 '12 at 22:30
oh. well, I mean things like letters and characters such as $(#*@. I don't know how to explain it other than I only want characters you'd be able to type on your keyboard. — LordZardeck, Jan 08 '12 at 22:32
I can type "あいうえお" on *my* keyboard... Maybe you just have an *encoding problem* and should interpret the text in the right encoding instead of removing things? — deceze, Jan 08 '12 at 22:52
as an added note, you can run into this on some data as a pair with 194 followed by 160 which is the result of a cut/paste and unicode mangling of the HTML — Scott, Jun 01 '16 at 15:29
Â is a ***signature*** start of a ***[UTF-8](https://en.wikipedia.org/wiki/UTF-8) sequence*** (0xC2, octal 302, decimal 194). Another is (0xE2, octal 342, decimal 226). See e.g. code page [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout) or [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout). — Peter Mortensen, May 03 '23 at 14:38
For example, 342 200 234 (octal) → 0xE2 0x80 0x9C (hexadecimal) → UTF-8 sequence for Unicode code point U+201C ([LEFT DOUBLE QUOTATION MARK](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128)). Most are three bytes, but there is also the very common 302 240 (octal) → 0xC2 0xA0 (hexadecimal) → UTF-8 sequence for Unicode code point U+00A0 ([NO-BREAK SPACE](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=156&number=128)). — Peter Mortensen, May 03 '23 at 14:38

Chris Bornhoft · Accepted Answer · 2018-03-05T15:06:35.130

124

A regex replace would be the best option. Using $str as an example string and matching it using :print:, which is a POSIX Character Class:

$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

What :print: does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.

Note: Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.

edited Mar 05 '18 at 15:06

answered Jan 08 '12 at 22:34

Chris Bornhoft

4,195
4
37
55

4

This solution is not working for me. :( I am getting aAÂ. php 5.3.0. (windows) – DamirR Jan 08 '12 at 23:12
this solution is dependant on the localisation of the perl regex library... in particular it seems to require a broken bersion – Jasen Aug 12 '14 at 00:03
@Jasen They're known as [POSIX Character Classes](http://www.regular-expressions.info/posixbrackets.html). They work with any version, but require ASCII to be the selected character set within PHP, since Character Classes also support Unicode fully. I've updated my answer accordingly. – Chris Bornhoft Aug 12 '14 at 16:16
1

How do you make ASCII the selected character set via code? – vcardillo Oct 17 '14 at 19:29
This is a solution for PHP string variable and not for PHP array variable. **What is the solution for PHP array variable containing these htmlentitycodes Â = `Â` which is a-circumflex?** – Neocortex Dec 03 '14 at 07:14
@BannedfromSO Take a look at the [`array_map`](http://php.net/manual/en/function.array-map.php) function. – Chris Bornhoft Dec 03 '14 at 16:59
@ChrisBornhoft - Yes I did this `$a = array_map('trim',$array);` – Neocortex Dec 04 '14 at 03:50
Any ideas why this allows any [UTF8 character](https://apps.timwhitlock.info/emoji/tables/unicode) even when PHP has been setup to use Windows-1252 with `ini_set('default_charset', 'windows-1252');`? I want to get rid of all those Unicode characters and allow only characters from the [Windows-1252 codepage](http://www.kostis.net/charsets/cp1252.htm). – andreszs Feb 07 '18 at 01:59
if you use `[:print:]` some characters may be changed to `?`, see here for more info on a workaround: https://alvinalexander.com/php/how-to-remove-non-printable-characters-in-string-regex – degenerate May 17 '18 at 15:34
3

yes, this answer only works on misconfigured systems 'Â' is clearly a printing character:(it is both inked, and consumes space) use `'/[[:^ascii:]]/''` instead of `'/[[:^print:]]/'` to strip non-ascii. – Jasen Sep 08 '19 at 22:18
1

Jasen, your correction was the right solution for me at least. – Hobbes Dec 15 '20 at 04:26
1

@Jasen your answer is the correct one. Thanks – Plugie May 07 '21 at 09:26
Didn't work as it made for example from `Anton Dovečer` -> `Anton Doveer` but I'd expect it to do it to `Boris Dovecer` – Kaspar L. Palgi Oct 23 '21 at 18:38
1

@KasparL.Palgi that is *exactly* what the original question asked to accomplish: remove the characters completely. To replace with an non-accented character, you would need to create a custom mapping of the characters you'd like to replace first. – Chris Bornhoft Oct 24 '21 at 19:22

score 52 · Answer 2 · edited May 03 '23 at 14:46

52

Do you want only ASCII printable characters?

Use this:

<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/', '', $str);
echo "($str)($res)";

Or even better, convert your input to UTF-8 and use phputf8 lib to translate 'not normal' characters into their ASCII representation:

require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');

if(!utf8_is_valid($str))
{
  $str = utf8_bad_strip($str);
}

$str = utf8_to_ascii($str, '');

edited May 03 '23 at 14:46

Peter Mortensen

30,738
21
105
131

answered Jan 08 '12 at 22:51

DamirR

1,696
1
14
15

2

I also wanted to keep the tab character, so I used this regular expression: [^\x00-\x7E] – John Langford May 25 '17 at 16:51
1

Thank you! So much better than the accepted answer! over 10 years later, this saved me a lot of grief! – user6096790 Nov 05 '22 at 09:38

score 38 · Answer 3 · edited May 03 '23 at 15:05

38

Use:

$clearstring = filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

Note that FILTER_SANITIZE_STRING is deprecated since PHP 8.1.

edited May 03 '23 at 15:05

Peter Mortensen

30,738
21
105
131

answered Aug 24 '15 at 08:46

Utopia

663
7
8

Seems perfect for PHP >= 5.2 – user414873 Oct 22 '15 at 13:33
This seems to also strip tags. For me it was removing <%AnyTextHere%> See [PHP Sanitize filters](http://php.net/manual/en/filter.filters.sanitize.php) – ds00424 Sep 03 '16 at 16:41
Heads up: if you [go to functions-online.com to test this](https://ru.functions-online.com/filter_var.html?command={%22variable%22:%22\uf8ff%22,%22filter%22:%22FILTER_SANITIZE_STRING%22,%22options%22:%22FILTER_FLAG_STRIP_HIGH%22}), it will put single quotes around `FILTER_FLAG_STRIP_HIGH` which stops it from working – ᴍᴇʜᴏᴠ Feb 03 '20 at 12:26
This was helpful. Though I used FILTER_FLAG_ENCODE_HIGH instead of FILTER_FLAG_STRIP_HIGH – bhar1red Apr 11 '22 at 21:38
1

`FILTER_SANITIZE_STRING` is deprecated since PHP 8.1 – Oleg Jan 08 '23 at 17:55

score 26 · Answer 4 · edited May 03 '23 at 14:48

Kind of related: We had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

The solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

Normally I would do something like this:

<?php
// transliterate
if (function_exists('iconv')) {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }
?>

... but that replaces everything that can't be translated into a question mark (?).

So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

<?php
public function cleanNonAsciiCharactersInString($orig_text) {

    $text = $orig_text;

    // Single letters
    $text = preg_replace("/[∂άαáàâãªä]/u",      "a", $text);
    $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u",     "A", $text);
    $text = preg_replace("/[ЂЪЬБъь]/u",           "b", $text);
    $text = preg_replace("/[βвВ]/u",            "B", $text);
    $text = preg_replace("/[çς©с]/u",            "c", $text);
    $text = preg_replace("/[ÇС]/u",              "C", $text);
    $text = preg_replace("/[δ]/u",             "d", $text);
    $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
    $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u",     "E", $text);
    $text = preg_replace("/[₣]/u",               "F", $text);
    $text = preg_replace("/[НнЊњ]/u",           "H", $text);
    $text = preg_replace("/[ђћЋ]/u",            "h", $text);
    $text = preg_replace("/[ÍÌÎÏ]/u",           "I", $text);
    $text = preg_replace("/[íìîïιίϊі]/u",       "i", $text);
    $text = preg_replace("/[Јј]/u",             "j", $text);
    $text = preg_replace("/[ΚЌК]/u",            'K', $text);
    $text = preg_replace("/[ќк]/u",             'k', $text);
    $text = preg_replace("/[ℓ∟]/u",             'l', $text);
    $text = preg_replace("/[Мм]/u",             "M", $text);
    $text = preg_replace("/[ñηήηπⁿ]/u",            "n", $text);
    $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u",       "N", $text);
    $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
    $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u",     "O", $text);
    $text = preg_replace("/[ρφрРф]/u",          "p", $text);
    $text = preg_replace("/[®яЯ]/u",              "R", $text);
    $text = preg_replace("/[ГЃгѓ]/u",              "r", $text);
    $text = preg_replace("/[Ѕ]/u",              "S", $text);
    $text = preg_replace("/[ѕ]/u",              "s", $text);
    $text = preg_replace("/[Тт]/u",              "T", $text);
    $text = preg_replace("/[τ†‡]/u",              "t", $text);
    $text = preg_replace("/[úùûüџμΰµυϋύ]/u",     "u", $text);
    $text = preg_replace("/[√]/u",               "v", $text);
    $text = preg_replace("/[ÚÙÛÜЏЦц]/u",         "U", $text);
    $text = preg_replace("/[Ψψωώẅẃẁщш]/u",      "w", $text);
    $text = preg_replace("/[ẀẄẂШЩ]/u",          "W", $text);
    $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
    $text = preg_replace("/[ỲΫ¥]/u",           "Y", $text);
    $text = preg_replace("/[ỳγўЎУуч]/u",       "y", $text);
    $text = preg_replace("/[ζ]/u",              "Z", $text);

    // Punctuation
    $text = preg_replace("/[‚‚]/u", ",", $text);
    $text = preg_replace("/[`‛′’‘]/u", "'", $text);
    $text = preg_replace("/[″“”«»„]/u", '"', $text);
    $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
    $text = preg_replace("/[  ]/u", ' ', $text);

    $text = str_replace("…", "...", $text);
    $text = str_replace("≠", "!=", $text);
    $text = str_replace("≤", "<=", $text);
    $text = str_replace("≥", ">=", $text);
    $text = preg_replace("/[‗≈≡]/u", "=", $text);


    // Exciting combinations
    $text = str_replace("ыЫ", "bl", $text);
    $text = str_replace("℅", "c/o", $text);
    $text = str_replace("₧", "Pts", $text);
    $text = str_replace("™", "tm", $text);
    $text = str_replace("№", "No", $text);
    $text = str_replace("Ч", "4", $text);
    $text = str_replace("‰", "%", $text);
    $text = preg_replace("/[∙•]/u", "*", $text);
    $text = str_replace("‹", "<", $text);
    $text = str_replace("›", ">", $text);
    $text = str_replace("‼", "!!", $text);
    $text = str_replace("⁄", "/", $text);
    $text = str_replace("∕", "/", $text);
    $text = str_replace("⅞", "7/8", $text);
    $text = str_replace("⅝", "5/8", $text);
    $text = str_replace("⅜", "3/8", $text);
    $text = str_replace("⅛", "1/8", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[Љљ]/u", "Ab", $text);
    $text = preg_replace("/[Юю]/u", "IO", $text);
    $text = preg_replace("/[ﬁﬂ]/u", "fi", $text);
    $text = preg_replace("/[зЗ]/u", "3", $text);
    $text = str_replace("£", "(pounds)", $text);
    $text = str_replace("₤", "(lira)", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
    $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);


    //2) Translation CP1252.
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
    $trans['-'] = array(
        '&hellip;',     // Horizontal Ellipsis
        '&tilde;',      // Small Tilde
        '&ndash;'       // Dash
        );
    $trans["+"] = '&dagger;';    // Dagger
    $trans['#'] = '&Dagger;';    // Double Dagger
    $trans['M'] = '&permil;';    // Per Mille Sign
    $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
    $trans["'"] = array(
        '&lsquo;',  // Left Single Quotation Mark
        '&rsquo;',  // Right Single Quotation Mark
        '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
        '&sbquo;',  // Single Low-9 Quotation Mark
        '&circ;',   // Modifier Letter Circumflex Accent
        '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
        );

    $trans['"'] = array(
        '&ldquo;',  // Left Double Quotation Mark
        '&rdquo;',  // Right Double Quotation Mark
        '&bdquo;',  // Double Low-9 Quotation Mark
        );

    $trans['*'] = '&bull;';    // Bullet
    $trans['n'] = '&ndash;';    // En Dash
    $trans['m'] = '&mdash;';    // Em Dash
    $trans['tm'] = '&trade;';    // Trade Mark Sign
    $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
    $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
    $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    $trans['euro'] = '&euro;';    // euro currency symbol
    ksort($trans);

    foreach ($trans as $k => $v) {
        $text = str_replace($v, $k, $text);
    }

    // 3) remove <p>, <br/> ...
    $text = strip_tags($text);

    // 4) &amp; => & &quot; => '
    $text = html_entity_decode($text);


    // transliterate
    // if (function_exists('iconv')) {
    // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    // }

    // remove non ascii characters
    // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);

    return $text;
}

?>

According to http://php.net/manual/en/function.iconv.php#74101 , that should only be an issue if you do not select a proper locale (other than C or POSIX) — MauganRa, Dec 16 '14 at 09:39
Re *"the first 128 characters of the ASCII character set"*: [ASCII](https://en.wikipedia.org/wiki/ASCII) only has 128: *"ASCII has just 128 code points"*. The last bit is used for extensions, like code page [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) or [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). — Peter Mortensen, May 03 '23 at 14:51

simhumileco · Answer 5 · 2018-10-17T12:57:12.983

I also think that the best solution might be to use a regular expression.

Here's my suggestion:

function convert_to_normal_text($text) {

    $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
    $normal_text = preg_replace("/[^$normal_characters]/", '', $text);

    return $normal_text;
}

Then you can use it like this:

$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;

Displays:

Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

FYI, Typo on line 4: '$normal_caracters' => '$normal_characters' — ds00424, Sep 03 '16 at 16:51

score 1 · Answer 6 · edited Sep 10 '13 at 16:40

1

I just had to add the header

header('Content-Type: text/html; charset=UTF-8');

edited Sep 10 '13 at 16:40

nhahtdh

55,989
15
126
162

answered Sep 10 '13 at 16:24

ALHaines

11
2

2

that will fix the case where UTF8 is being interpreted as WIN-1252 which is the default encoding for HTML, however it will not remove any characters from a string. – Jasen Aug 12 '14 at 00:10
They probably don't have control over the website: *"I'm getting strange characters when pulling data from a website:"* – Peter Mortensen May 03 '23 at 14:47

score 0 · Answer 7 · edited May 03 '23 at 14:58

0

This should be pretty straightforward and there isn't any need for an iconv function:

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));

// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

edited May 03 '23 at 14:58

Peter Mortensen

30,738
21
105
131

answered Mar 13 '15 at 07:30

Goran Jakovljevic

2,714
1
31
27

score 0 · Answer 8 · answered Sep 12 '22 at 08:42

0

My problem is solved

$text = 'Châu Thái  Nhân 12/09/2022';
echo preg_replace('/[\x00-\x1F\x7F]/', '', $text);
//Châu Thái  Nhân 12/09/2022

answered Sep 12 '22 at 08:42

Nhan Chau KP

1

What is the result? What does it do? Completely wipes out the characters? Removes non-printable characters? Please explain your solution. From [the Help Center](https://stackoverflow.com/help/promotion): *"...always explain why the solution you're presenting is appropriate and how it works"*. Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/73686605/edit), not here in comments (*** *** *** *** *** *** *** *** *** *** ***without*** *** *** *** *** *** *** *** *** *** *** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen May 03 '23 at 15:06

score -1 · Answer 9 · edited May 03 '23 at 15:01

I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on Unicode.

$name = "βγδεζηΘKgfgebhjrf!@#$%^&";
// This function will clear all non greek and english characters on greek-iso charset
function replace_characters($string)
{
    $str_length = strlen($string);
    for ($x=0; $x < $str_length; $x++)
    {
        $character = $string[$x];
        if ((ord($character)  >  64 && ord($character) <   91) ||
            (ord($character)  >  96 && ord($character) <  123) ||
            (ord($character)  > 192 && ord($character) <  210) ||
            (ord($character)  > 210 && ord($character) <  218) ||
            (ord($character)  > 219 && ord($character) <  250) ||
             ord($character) == 252 || ord($character) == 254)
        {
            $new_string = $new_string.$character;
        }
    }
    return $new_string;
}
// End function

$name = replace_characters($name);

echo $name;

You're doing ord() on the same character over and over again just for different comparisons (line 9). That's extremely inefficient. You should save result of ord() in variable and then reuse it in conditional. Also, consider using === instead of == as use of == is discouraged. Although I don't blame you for this, ironically PHP manual for ord() shows using == in examples. — xZero, Nov 25 '17 at 13:50

Remove non-ASCII characters from string

9 Answers9

Linked

Related