1

My sentence include ASCII character codes like

"#$%

How can I remove all ASCII codes?

I tried strip_tags(), html_entity_decode(), and htmlspecialchars(), and they did not work.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user198989
  • 4,574
  • 19
  • 66
  • 95
  • 1
    If you remove these characters, aren't you going to loose the meaning of your sentence? – Jocelyn Aug 24 '12 at 13:53
  • No, my sentences include japanese chars and normal chars. I need to remove japanese chars. – user198989 Aug 24 '12 at 13:57
  • 1
    But those aren't Japanese characters, they are `"#$%` – Mihai Iorga Aug 24 '12 at 13:58
  • 1
    This might help a little: http://stackoverflow.com/questions/1497885/remove-control-characters-from-php-string though it won't get the type html entity layout, you might be looking for a different preg_replace. – Sammaye Aug 24 '12 at 14:01
  • 1
    For things to work here you have to tell the exact problem, you can't say you have a problem with your dog, but in fact is your cat that got blind, but you are talking about your dog – Mihai Iorga Aug 24 '12 at 14:01

4 Answers4

2

You could run this if you don't want the returning values:

preg_replace('/(&#x[0-9]{4};)/', '', $text);

But be warned. This is basically a nuker and with the way HTML entities work I am sure this will interfer with other parts of your string. I would recommend leaving them in personally and encoding them as @hakra shows.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
2

Are you trying to remove entities that resolve to non-ascii characters? If that is what you want you can use this code:

$str = '" # $ % 琔'; // " # $ % 琔
// decode entities
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// remove non-ascii characters
$str = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $str);

Or

// decode only iso-8859-1 entities
$str = html_entity_decode($str, ENT_QUOTES, 'iso-8859-1');
// remove any entities that remain
$str = preg_replace('/&#(x[0-9]{4}|\d+);/', '', $str);

If that's not what you want you need to clarify the question.

mcrumley
  • 5,682
  • 3
  • 25
  • 33
1

If you have the multibyte string extension at hand, this works:

$string = '"#$%';
mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');

Which does give:

"#$%

Loosely related is:


With the DOM extension you could load it and convert it to a string which probably has the benefit to better deal with HTML elements and such:

echo simplexml_import_dom(@DomDocument::loadHTML('"#$%'))->xpath('//body/p')[0];

Which does output:

"#$%

If it contains HTML, you might need to export the inner html of that element which is explained in some other answer:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
-1

To remove Japanese characters from a string, you may use the following code:

// Decode the text to get correct UTF-8 text:
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');

// Use the UTF-8 properties with `preg_replace` to remove all Japanese characters
$text = preg_replace('/\p{Katakana}|\p{Hiragana}|\p{Han}/u', '', $text);

Documentation:

Unicode character properties
Unicode scripts

Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.

Try the code here

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jocelyn
  • 11,209
  • 10
  • 43
  • 60