74

I'm getting strange characters when pulling data from a website:

Â

How can I remove anything that isn't a non-extended ASCII character?


A more appropriate question can be found here: PHP - replace all non-alphanumeric chars for all languages supported

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
LordZardeck
  • 7,953
  • 19
  • 62
  • 119
  • 1
    What do you mean when you say non-ascii, `Â` is an ascii character (#194) – Drew Galbraith Jan 08 '12 at 22:30
  • 1
    oh. well, I mean things like letters and characters such as $(#*@. I don't know how to explain it other than I only want characters you'd be able to type on your keyboard. – LordZardeck Jan 08 '12 at 22:32
  • Do you mean non-alphanumeric? – j08691 Jan 08 '12 at 22:34
  • 2
    Could you define what are normal characters? – Shiplu Mokaddim Jan 08 '12 at 22:34
  • 9
    I can type "あいうえお" on *my* keyboard... Maybe you just have an *encoding problem* and should interpret the text in the right encoding instead of removing things? – deceze Jan 08 '12 at 22:52
  • as an added note, you can run into this on some data as a pair with 194 followed by 160 which is the result of a cut/paste and unicode mangling of the HTML   – Scott Jun 01 '16 at 15:29
  • 1
    @DrewGalbraith #194 is not ASCII, ASCII only goes to #127 – Jasen Sep 08 '19 at 22:34
  • Â is a ***signature*** start of a ***[UTF-8](https://en.wikipedia.org/wiki/UTF-8) sequence*** (0xC2, octal 302, decimal 194). Another is (0xE2, octal 342, decimal 226). See e.g. code page [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout) or [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout). – Peter Mortensen May 03 '23 at 14:38
  • For example, 342 200 234 (octal) → 0xE2 0x80 0x9C (hexadecimal) → UTF-8 sequence for Unicode code point U+201C ([LEFT DOUBLE QUOTATION MARK](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128)). Most are three bytes, but there is also the very common 302 240 (octal) → 0xC2 0xA0 (hexadecimal) → UTF-8 sequence for Unicode code point U+00A0 ([NO-BREAK SPACE](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=156&number=128)). – Peter Mortensen May 03 '23 at 14:38

9 Answers9

124

A regex replace would be the best option. Using $str as an example string and matching it using :print:, which is a POSIX Character Class:

$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

What :print: does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.

Note: Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.

Chris Bornhoft
  • 4,195
  • 4
  • 37
  • 55
  • 4
    This solution is not working for me. :( I am getting aAÂ. php 5.3.0. (windows) – DamirR Jan 08 '12 at 23:12
  • this solution is dependant on the localisation of the perl regex library... in particular it seems to require a broken bersion – Jasen Aug 12 '14 at 00:03
  • @Jasen They're known as [POSIX Character Classes](http://www.regular-expressions.info/posixbrackets.html). They work with any version, but require ASCII to be the selected character set within PHP, since Character Classes also support Unicode fully. I've updated my answer accordingly. – Chris Bornhoft Aug 12 '14 at 16:16
  • 1
    How do you make ASCII the selected character set via code? – vcardillo Oct 17 '14 at 19:29
  • This is a solution for PHP string variable and not for PHP array variable. **What is the solution for PHP array variable containing these htmlentitycodes  = `Â` which is a-circumflex?** – Neocortex Dec 03 '14 at 07:14
  • @BannedfromSO Take a look at the [`array_map`](http://php.net/manual/en/function.array-map.php) function. – Chris Bornhoft Dec 03 '14 at 16:59
  • @ChrisBornhoft - Yes I did this `$a = array_map('trim',$array);` – Neocortex Dec 04 '14 at 03:50
  • Any ideas why this allows any [UTF8 character](https://apps.timwhitlock.info/emoji/tables/unicode) even when PHP has been setup to use Windows-1252 with `ini_set('default_charset', 'windows-1252');`? I want to get rid of all those Unicode characters and allow only characters from the [Windows-1252 codepage](http://www.kostis.net/charsets/cp1252.htm). – andreszs Feb 07 '18 at 01:59
  • if you use `[:print:]` some characters may be changed to `?`, see here for more info on a workaround: https://alvinalexander.com/php/how-to-remove-non-printable-characters-in-string-regex – degenerate May 17 '18 at 15:34
  • 3
    yes, this answer only works on misconfigured systems 'Â' is clearly a printing character:(it is both inked, and consumes space) use `'/[[:^ascii:]]/''` instead of `'/[[:^print:]]/'` to strip non-ascii. – Jasen Sep 08 '19 at 22:18
  • 1
    Jasen, your correction was the right solution for me at least. – Hobbes Dec 15 '20 at 04:26
  • 1
    @Jasen your answer is the correct one. Thanks – Plugie May 07 '21 at 09:26
  • Didn't work as it made for example from `Anton Dovečer` -> `Anton Doveer` but I'd expect it to do it to `Boris Dovecer` – Kaspar L. Palgi Oct 23 '21 at 18:38
  • 1
    @KasparL.Palgi that is *exactly* what the original question asked to accomplish: remove the characters completely. To replace with an non-accented character, you would need to create a custom mapping of the characters you'd like to replace first. – Chris Bornhoft Oct 24 '21 at 19:22
52

Do you want only ASCII printable characters?

Use this:

<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/', '', $str);
echo "($str)($res)";

Or even better, convert your input to UTF-8 and use phputf8 lib to translate 'not normal' characters into their ASCII representation:

require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');

if(!utf8_is_valid($str))
{
  $str = utf8_bad_strip($str);
}

$str = utf8_to_ascii($str, '');
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
DamirR
  • 1,696
  • 1
  • 14
  • 15
38

Use:

$clearstring = filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

Note that FILTER_SANITIZE_STRING is deprecated since PHP 8.1.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Utopia
  • 663
  • 7
  • 8
  • Seems perfect for PHP >= 5.2 – user414873 Oct 22 '15 at 13:33
  • This seems to also strip tags. For me it was removing <%AnyTextHere%> See [PHP Sanitize filters](http://php.net/manual/en/filter.filters.sanitize.php) – ds00424 Sep 03 '16 at 16:41
  • Heads up: if you [go to functions-online.com to test this](https://ru.functions-online.com/filter_var.html?command={%22variable%22:%22\uf8ff%22,%22filter%22:%22FILTER_SANITIZE_STRING%22,%22options%22:%22FILTER_FLAG_STRIP_HIGH%22}), it will put single quotes around `FILTER_FLAG_STRIP_HIGH` which stops it from working – ᴍᴇʜᴏᴠ Feb 03 '20 at 12:26
  • This was helpful. Though I used FILTER_FLAG_ENCODE_HIGH instead of FILTER_FLAG_STRIP_HIGH – bhar1red Apr 11 '22 at 21:38
  • 1
    `FILTER_SANITIZE_STRING` is deprecated since PHP 8.1 – Oleg Jan 08 '23 at 17:55
26

Kind of related: We had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

The solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

Normally I would do something like this:

<?php
// transliterate
if (function_exists('iconv')) {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }
?>

... but that replaces everything that can't be translated into a question mark (?).

So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

<?php
public function cleanNonAsciiCharactersInString($orig_text) {

    $text = $orig_text;

    // Single letters
    $text = preg_replace("/[∂άαáàâãªä]/u",      "a", $text);
    $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u",     "A", $text);
    $text = preg_replace("/[ЂЪЬБъь]/u",           "b", $text);
    $text = preg_replace("/[βвВ]/u",            "B", $text);
    $text = preg_replace("/[çς©с]/u",            "c", $text);
    $text = preg_replace("/[ÇС]/u",              "C", $text);
    $text = preg_replace("/[δ]/u",             "d", $text);
    $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
    $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u",     "E", $text);
    $text = preg_replace("/[₣]/u",               "F", $text);
    $text = preg_replace("/[НнЊњ]/u",           "H", $text);
    $text = preg_replace("/[ђћЋ]/u",            "h", $text);
    $text = preg_replace("/[ÍÌÎÏ]/u",           "I", $text);
    $text = preg_replace("/[íìîïιίϊі]/u",       "i", $text);
    $text = preg_replace("/[Јј]/u",             "j", $text);
    $text = preg_replace("/[ΚЌК]/u",            'K', $text);
    $text = preg_replace("/[ќк]/u",             'k', $text);
    $text = preg_replace("/[ℓ∟]/u",             'l', $text);
    $text = preg_replace("/[Мм]/u",             "M", $text);
    $text = preg_replace("/[ñηήηπⁿ]/u",            "n", $text);
    $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u",       "N", $text);
    $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
    $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u",     "O", $text);
    $text = preg_replace("/[ρφрРф]/u",          "p", $text);
    $text = preg_replace("/[®яЯ]/u",              "R", $text);
    $text = preg_replace("/[ГЃгѓ]/u",              "r", $text);
    $text = preg_replace("/[Ѕ]/u",              "S", $text);
    $text = preg_replace("/[ѕ]/u",              "s", $text);
    $text = preg_replace("/[Тт]/u",              "T", $text);
    $text = preg_replace("/[τ†‡]/u",              "t", $text);
    $text = preg_replace("/[úùûüџμΰµυϋύ]/u",     "u", $text);
    $text = preg_replace("/[√]/u",               "v", $text);
    $text = preg_replace("/[ÚÙÛÜЏЦц]/u",         "U", $text);
    $text = preg_replace("/[Ψψωώẅẃẁщш]/u",      "w", $text);
    $text = preg_replace("/[ẀẄẂШЩ]/u",          "W", $text);
    $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
    $text = preg_replace("/[ỲΫ¥]/u",           "Y", $text);
    $text = preg_replace("/[ỳγўЎУуч]/u",       "y", $text);
    $text = preg_replace("/[ζ]/u",              "Z", $text);

    // Punctuation
    $text = preg_replace("/[‚‚]/u", ",", $text);
    $text = preg_replace("/[`‛′’‘]/u", "'", $text);
    $text = preg_replace("/[″“”«»„]/u", '"', $text);
    $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
    $text = preg_replace("/[  ]/u", ' ', $text);

    $text = str_replace("…", "...", $text);
    $text = str_replace("≠", "!=", $text);
    $text = str_replace("≤", "<=", $text);
    $text = str_replace("≥", ">=", $text);
    $text = preg_replace("/[‗≈≡]/u", "=", $text);


    // Exciting combinations
    $text = str_replace("ыЫ", "bl", $text);
    $text = str_replace("℅", "c/o", $text);
    $text = str_replace("₧", "Pts", $text);
    $text = str_replace("™", "tm", $text);
    $text = str_replace("№", "No", $text);
    $text = str_replace("Ч", "4", $text);
    $text = str_replace("‰", "%", $text);
    $text = preg_replace("/[∙•]/u", "*", $text);
    $text = str_replace("‹", "<", $text);
    $text = str_replace("›", ">", $text);
    $text = str_replace("‼", "!!", $text);
    $text = str_replace("⁄", "/", $text);
    $text = str_replace("∕", "/", $text);
    $text = str_replace("⅞", "7/8", $text);
    $text = str_replace("⅝", "5/8", $text);
    $text = str_replace("⅜", "3/8", $text);
    $text = str_replace("⅛", "1/8", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[Љљ]/u", "Ab", $text);
    $text = preg_replace("/[Юю]/u", "IO", $text);
    $text = preg_replace("/[fifl]/u", "fi", $text);
    $text = preg_replace("/[зЗ]/u", "3", $text);
    $text = str_replace("£", "(pounds)", $text);
    $text = str_replace("₤", "(lira)", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
    $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);


    //2) Translation CP1252.
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
    $trans['-'] = array(
        '&hellip;',     // Horizontal Ellipsis
        '&tilde;',      // Small Tilde
        '&ndash;'       // Dash
        );
    $trans["+"] = '&dagger;';    // Dagger
    $trans['#'] = '&Dagger;';    // Double Dagger
    $trans['M'] = '&permil;';    // Per Mille Sign
    $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
    $trans["'"] = array(
        '&lsquo;',  // Left Single Quotation Mark
        '&rsquo;',  // Right Single Quotation Mark
        '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
        '&sbquo;',  // Single Low-9 Quotation Mark
        '&circ;',   // Modifier Letter Circumflex Accent
        '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
        );

    $trans['"'] = array(
        '&ldquo;',  // Left Double Quotation Mark
        '&rdquo;',  // Right Double Quotation Mark
        '&bdquo;',  // Double Low-9 Quotation Mark
        );

    $trans['*'] = '&bull;';    // Bullet
    $trans['n'] = '&ndash;';    // En Dash
    $trans['m'] = '&mdash;';    // Em Dash
    $trans['tm'] = '&trade;';    // Trade Mark Sign
    $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
    $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
    $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    $trans['euro'] = '&euro;';    // euro currency symbol
    ksort($trans);

    foreach ($trans as $k => $v) {
        $text = str_replace($v, $k, $text);
    }

    // 3) remove <p>, <br/> ...
    $text = strip_tags($text);

    // 4) &amp; => & &quot; => '
    $text = html_entity_decode($text);


    // transliterate
    // if (function_exists('iconv')) {
    // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    // }

    // remove non ascii characters
    // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);

    return $text;
}

?>
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Silas Palmer
  • 2,687
  • 1
  • 29
  • 30
  • According to http://php.net/manual/en/function.iconv.php#74101 , that should only be an issue if you do not select a proper locale (other than C or POSIX) – MauganRa Dec 16 '14 at 09:39
  • 1
    there are only 128 characters in the ascii character set. – Jasen Sep 08 '19 at 22:21
  • Re *"the first 128 characters of the ASCII character set"*: [ASCII](https://en.wikipedia.org/wiki/ASCII) only has 128: *"ASCII has just 128 code points"*. The last bit is used for extensions, like code page [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) or [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). – Peter Mortensen May 03 '23 at 14:51
2

I also think that the best solution might be to use a regular expression.

Here's my suggestion:

function convert_to_normal_text($text) {

    $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
    $normal_text = preg_replace("/[^$normal_characters]/", '', $text);

    return $normal_text;
}

Then you can use it like this:

$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;

Displays:

Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .
simhumileco
  • 31,877
  • 16
  • 137
  • 115
1

I just had to add the header

header('Content-Type: text/html; charset=UTF-8');
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
ALHaines
  • 11
  • 2
  • 2
    that will fix the case where UTF8 is being interpreted as WIN-1252 which is the default encoding for HTML, however it will not remove any characters from a string. – Jasen Aug 12 '14 at 00:10
  • They probably don't have control over the website: *"I'm getting strange characters when pulling data from a website:"* – Peter Mortensen May 03 '23 at 14:47
0

This should be pretty straightforward and there isn't any need for an iconv function:

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));

// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Goran Jakovljevic
  • 2,714
  • 1
  • 31
  • 27
0

My problem is solved

$text = 'Châu Thái  Nhân 12/09/2022';
echo preg_replace('/[\x00-\x1F\x7F]/', '', $text);
//Châu Thái  Nhân 12/09/2022
  • What is the result? What does it do? Completely wipes out the characters? Removes non-printable characters? Please explain your solution. From [the Help Center](https://stackoverflow.com/help/promotion): *"...always explain why the solution you're presenting is appropriate and how it works"*. Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/73686605/edit), not here in comments (*** *** *** *** *** *** *** *** *** *** ***without*** *** *** *** *** *** *** *** *** *** *** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen May 03 '23 at 15:06
-1

I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on Unicode.

$name = "βγδεζηΘKgfgebhjrf!@#$%^&";
// This function will clear all non greek and english characters on greek-iso charset
function replace_characters($string)
{
    $str_length = strlen($string);
    for ($x=0; $x < $str_length; $x++)
    {
        $character = $string[$x];
        if ((ord($character)  >  64 && ord($character) <   91) ||
            (ord($character)  >  96 && ord($character) <  123) ||
            (ord($character)  > 192 && ord($character) <  210) ||
            (ord($character)  > 210 && ord($character) <  218) ||
            (ord($character)  > 219 && ord($character) <  250) ||
             ord($character) == 252 || ord($character) == 254)
        {
            $new_string = $new_string.$character;
        }
    }
    return $new_string;
}
// End function

$name = replace_characters($name);

echo $name;
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 1
    Heavy-handed but tweakable... I like it. – Kristen Waite Oct 07 '15 at 13:23
  • 2
    You're doing ord() on the same character over and over again just for different comparisons (line 9). That's extremely inefficient. You should save result of ord() in variable and then reuse it in conditional. Also, consider using === instead of == as use of == is discouraged. Although I don't blame you for this, ironically PHP manual for ord() shows using == in examples. – xZero Nov 25 '17 at 13:50