How to remove all ASCII codes from a string

Question

My sentence include ASCII character codes like

&#x0022;&#x0023;&#x0024;&#x0025;

How can I remove all ASCII codes?

I tried strip_tags(), html_entity_decode(), and htmlspecialchars(), and they did not work.

If you remove these characters, aren't you going to loose the meaning of your sentence? — Jocelyn, Aug 24 '12 at 13:53
No, my sentences include japanese chars and normal chars. I need to remove japanese chars. — user198989, Aug 24 '12 at 13:57
This might help a little: http://stackoverflow.com/questions/1497885/remove-control-characters-from-php-string though it won't get the type html entity layout, you might be looking for a different preg_replace. — Sammaye, Aug 24 '12 at 14:01
For things to work here you have to tell the exact problem, you can't say you have a problem with your dog, but in fact is your cat that got blind, but you are talking about your dog — Mihai Iorga, Aug 24 '12 at 14:01

score 2 · Answer 1 · answered Aug 24 '12 at 14:13

2

You could run this if you don't want the returning values:

preg_replace('/(&#x[0-9]{4};)/', '', $text);

But be warned. This is basically a nuker and with the way HTML entities work I am sure this will interfer with other parts of your string. I would recommend leaving them in personally and encoding them as @hakra shows.

answered Aug 24 '12 at 14:13

Sammaye

43,242
7
104
146

Warning: Wrong parameter count for preg_replace() – user198989 Aug 24 '12 at 14:17

score 2 · Answer 2 · answered Aug 24 '12 at 14:34

Are you trying to remove entities that resolve to non-ascii characters? If that is what you want you can use this code:

$str = '&#x0022; &#x0023; &#x0024; &#x0025; &#x7414;'; // " # $ % 琔
// decode entities
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// remove non-ascii characters
$str = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $str);

Or

// decode only iso-8859-1 entities
$str = html_entity_decode($str, ENT_QUOTES, 'iso-8859-1');
// remove any entities that remain
$str = preg_replace('/&#(x[0-9]{4}|\d+);/', '', $str);

If that's not what you want you need to clarify the question.

score 1 · Answer 3 · edited May 23 '17 at 12:21

If you have the multibyte string extension at hand, this works:

$string = '&#x0022;&#x0023;&#x0024;&#x0025;';
mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');

Which does give:

"#$%

Loosely related is:

PHP DomDocument failing to handle utf-8 characters (☆)

With the DOM extension you could load it and convert it to a string which probably has the benefit to better deal with HTML elements and such:

echo simplexml_import_dom(@DomDocument::loadHTML('&#x0022;&#x0023;&#x0024;&#x0025;'))->xpath('//body/p')[0];

Which does output:

"#$%

If it contains HTML, you might need to export the inner html of that element which is explained in some other answer:

DOMDocument : how to get inner HTML as Strings separated by line-breaks?

score -1 · Answer 4 · edited Aug 25 '12 at 07:16

To remove Japanese characters from a string, you may use the following code:

// Decode the text to get correct UTF-8 text:
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');

// Use the UTF-8 properties with `preg_replace` to remove all Japanese characters
$text = preg_replace('/\p{Katakana}|\p{Hiragana}|\p{Han}/u', '', $text);

Documentation:

Unicode character properties
Unicode scripts

Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.

Try the code here

How to remove all ASCII codes from a string

4 Answers4