How to remove html special chars?

Question

I am creating a RSS feed file for my application in which I want to remove HTML tags, which is done by strip_tags. But strip_tags is not removing HTML special code chars:

&nbsp; &amp; &copy;

etc.

Please tell me any function which I can use to remove these special code chars from my string.

schnaader · Accepted Answer · 2009-12-14T20:43:59.653

123

Either decode them using html_entity_decode or remove them using preg_replace:

$Content = preg_replace("/&#?[a-z0-9]+;/i","",$Content);

(From here)

EDIT: Alternative according to Jacco's comment

might be nice to replace the '+' with {2,8} or something. This will limit the chance of replacing entire sentences when an unencoded '&' is present.

$Content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$Content);

edited Dec 14 '09 at 20:43

answered Mar 18 '09 at 10:16

schnaader

49,103
10
104
136

6

might be nice to replace the '+' with '{2,8] or something. This will limit the chance of replacing entire sentences when an unencoded '&' is present. – Jacco Mar 18 '09 at 12:36
Thanks, added your comment and an alternative version to the answer. – schnaader Mar 18 '09 at 12:42
I made a typo: it should be {2,8} sorry about that – Jacco Mar 18 '09 at 14:24
1

but why would one want to remove those characters? – andi Mar 18 '09 at 22:19
4

Those character-entities are not valid in RSS/Atom/XML. so you can do 2 thing: remove them, or replace them with their number-equivalent. – Jacco Mar 18 '09 at 23:36
1

A possible case for having to remove them is when stripping HTML for sending it as an alternate, plain-text body along in an email. – SquareCat Oct 24 '12 at 23:58
What about … & + hellip is that caught by the above regular expression? – Muskie Sep 09 '13 at 01:39
This is really clean and useful suggestions to remove special characters for title tag values which shows ? marks or decoded entities in the browser tab, which is not good. – raphie Feb 05 '15 at 17:19

score 21 · Answer 2 · edited Mar 18 '09 at 10:30

21

Use html_entity_decode to convert HTML entities.

You'll need to set charset to make it work correctly.

edited Mar 18 '09 at 10:30

vartec

131,205
36
218
244

answered Mar 18 '09 at 10:15

andi

14,322
9
47
46

1

this is more correctly because when we just replace with empty string we get incorrect result - all non breakable spaces are collapsed – heximal Mar 30 '12 at 19:09
2

This! All you need is run the `html_entity_decode` on the string and then use `strip_tags` and lastly use `filter_var($string, FILTER_SANITIZE_STRING)`. – Mohd Abdul Mujib Jan 14 '19 at 13:31

score 19 · Answer 3 · edited Jun 12 '22 at 13:18

19

In addition to the good answers above, PHP also has a built-in filter function that is quite useful: filter_var.

To remove HTML characters, use:

$cleanString = filter_var($dirtyString, FILTER_SANITIZE_STRING);

More info:

edited Jun 12 '22 at 13:18

donohoe

13,867
4
37
59

answered Feb 16 '12 at 16:59

gpkamp

231
2
6

1

I know the thread is a little old, but I am looking to solve the same problem... Unfortunately filter_var requires 5.2 or newer...Otherwise this would be the answer (at least to my specific problem). Thanks. – ChronoFish May 14 '12 at 14:55

score 10 · Answer 4 · answered Mar 18 '09 at 10:16

You may want take a look at htmlentities() and html_entity_decode() here

$orig = "I'll \"walk\" the <b>dog</b> now";

$a = htmlentities($orig);

$b = html_entity_decode($a);

echo $a; // I'll &quot;walk&quot; the &lt;b&gt;dog&lt;/b&gt; now

echo $b; // I'll "walk" the <b>dog</b> now

score 4 · Answer 5 · edited Aug 07 '13 at 08:14

4

This might work well to remove special characters.

$modifiedString = preg_replace("/[^a-zA-Z0-9_.-\s]/", "", $content);

edited Aug 07 '13 at 08:14

Joetjah

6,292
8
55
90

answered Mar 29 '13 at 09:58

Vinit Kadkol

1,221
13
12

Jay · Answer 6 · 2018-05-15T14:33:50.797

If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...

function htmlToPlainText($str){
    $str = str_replace('&nbsp;', ' ', $str);
    $str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
    $str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
    $str = html_entity_decode($str);
    $str = htmlspecialchars_decode($str);
    $str = strip_tags($str);

    return $str;
}

$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'

htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like ' htmlspecialchars_decode converts things like & html_entity_decode converts things like '< and strip_tags removes any HTML tags left over.

EDIT - Added str_replace(' ', ' ', $str); and several other html_entity_decode() as continued testing has shown a need for them.

also add str_replace(" ", " ", $str); so that don't get to covert in some type of special char as it was happening in my case . — Vikas Chauhan, May 14 '18 at 10:47

score 2 · Answer 7 · answered Dec 16 '13 at 15:36

2

What I have done was to use: html_entity_decode, then use strip_tags to removed them.

answered Dec 16 '13 at 15:36

Gwapz Juan

41
3

score 2 · Answer 8 · answered Mar 11 '14 at 04:11

2

try this

<?php
$str = "\x8F!!!";

// Outputs an empty string
echo htmlentities($str, ENT_QUOTES, "UTF-8");

// Outputs "!!!"
echo htmlentities($str, ENT_QUOTES | ENT_IGNORE, "UTF-8");
?>

answered Mar 11 '14 at 04:11

RaGu

723
2
9
29

4

make some notes "why your code works"? So it would be clear to others. – Praveen Mar 11 '14 at 04:28

score 2 · Answer 9 · answered Mar 18 '09 at 11:19

A plain vanilla strings way to do it without engaging the preg regex engine:

function remEntities($str) {
  if(substr_count($str, '&') && substr_count($str, ';')) {
    // Find amper
    $amp_pos = strpos($str, '&');
    //Find the ;
    $semi_pos = strpos($str, ';');
    // Only if the ; is after the &
    if($semi_pos > $amp_pos) {
      //is a HTML entity, try to remove
      $tmp = substr($str, 0, $amp_pos);
      $tmp = $tmp. substr($str, $semi_pos + 1, strlen($str));
      $str = $tmp;
      //Has another entity in it?
      if(substr_count($str, '&') && substr_count($str, ';'))
        $str = remEntities($tmp);
    }
  }
  return $str;
}

score 1 · Answer 10 · edited Jul 10 '10 at 17:33

<?php
function strip_only($str, $tags, $stripContent = false) {
    $content = '';
    if(!is_array($tags)) {
        $tags = (strpos($str, '>') !== false
                 ? explode('>', str_replace('<', '', $tags))
                 : array($tags));
        if(end($tags) == '') array_pop($tags);
    }
    foreach($tags as $tag) {
        if ($stripContent)
             $content = '(.+</'.$tag.'[^>]*>|)';
         $str = preg_replace('#</?'.$tag.'[^>]*>'.$content.'#is', '', $str);
    }
    return $str;
}

$str = '<font color="red">red</font> text';
$tags = 'font';
$a = strip_only($str, $tags); // red text
$b = strip_only($str, $tags, true); // text
?>

score 1 · Answer 11 · answered Mar 18 '09 at 16:21

It looks like what you really want is:

function xmlEntities($string) {
    $translationTable = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);

    foreach ($translationTable as $char => $entity) {
        $from[] = $entity;
        $to[] = '&#'.ord($char).';';
    }
    return str_replace($from, $to, $string);
}

It replaces the named-entities with their number-equivalent.

score 1 · Answer 12 · answered Jul 14 '11 at 15:08

1

The function I used to perform the task, joining the upgrade made by schnaader is:

    mysql_real_escape_string(
        preg_replace_callback("/&#?[a-z0-9]+;/i", function($m) { 
            return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); 
        }, strip_tags($row['cuerpo'])))

This function removes every html tag and html symbol, converted in UTF-8 ready to save in MySQL

answered Jul 14 '11 at 15:08

Lalala

11
1

creating an rss feed, not saving to sql – I wrestled a bear once. Dec 31 '13 at 15:35

score 0 · Answer 13 · edited Oct 01 '15 at 14:53

0

You can try htmlspecialchars_decode($string). It works for me.

http://www.w3schools.com/php/func_string_htmlspecialchars_decode.asp

edited Oct 01 '15 at 14:53

josliber

43,891
12
98
133

answered Oct 01 '15 at 12:56

surabhivin

139
6

3

Downvoted for linking to w3chools instead of the official documentation: http://php.net/htmlspecialchars_decode That said, this does not solve the OP's question. – Aaron Cicali Oct 01 '15 at 16:56

score 0 · Answer 14 · edited Sep 29 '20 at 09:07

If you are working in WordPress and are like me and simply need to check for an empty field (and there are a copious amount of random html entities in what seems like a blank string) then take a look at:

sanitize_title_with_dashes( string $title, string $raw_title = '', string $context = 'display' )

Link to wordpress function page

For people not working on WordPress, I found this function REALLY useful to create my own sanitizer, take a look at the full code and it's really in depth!

score -1 · Answer 15 · answered May 13 '15 at 11:32

$string = "äáčé";

$convert = Array(
        'ä'=>'a',
        'Ä'=>'A',
        'á'=>'a',
        'Á'=>'A',
        'à'=>'a',
        'À'=>'A',
        'ã'=>'a',
        'Ã'=>'A',
        'â'=>'a',
        'Â'=>'A',
        'č'=>'c',
        'Č'=>'C',
        'ć'=>'c',
        'Ć'=>'C',
        'ď'=>'d',
        'Ď'=>'D',
        'ě'=>'e',
        'Ě'=>'E',
        'é'=>'e',
        'É'=>'E',
        'ë'=>'e',
    );

$string = strtr($string , $convert );

echo $string; //aace

This doesn't answer the OPs issue – FluffyKitten Feb 24 '16 at 19:24 — FluffyKitten, Feb 24 '16 at 19:24

HoldOffHunger · Answer 16 · 2021-12-13T02:13:01.243

What If By "Remove HTML Special Chars" You Meant "Replace Appropriately"?

After all, just look at your example...

&nbsp; &amp; &copy;

If you're stripping this for an RSS feed, shouldn't you want the equivalents?

" ", &, ©

Or maybe you don't exactly want the equivalents. Maybe you'd want to have   just be ignored (to prevent too much space), but then have © actually get replaced. Let's work out a solution that solves anyone's version of this problem...

How to SELECTIVELY-REPLACE HTML Special Chars

The logic is simple: preg_match_all('/(&#[0-9]+;)/' grabs all of the matches, and then we simply build a list of matchables and replaceables, such as str_replace([searchlist], [replacelist], $term). Before we do this, we also need to convert named entities to their numeric counterparts, i.e., " " is unacceptable, but "&#00A0;" is fine. (Thanks to it-alien's solution to this part of the problem.)

Working Demo

In this demo, I replace { with "HTML Entity #123". Of course, you can fine-tune this to any kind of find-replace you want for your case.

Why did I make this? I use it with generating Rich Text Format from UTF8-character-encoded HTML.

See full working demo:

Full Online Working Demo

    function FixUTF8($args) {
        $output = $args['input'];
        
        $output = convertNamedHTMLEntitiesToNumeric(['input'=>$output]);
        
        preg_match_all('/(&#[0-9]+;)/', $output, $matches, PREG_OFFSET_CAPTURE);
        $full_matches = $matches[0];
        
        $found = [];
        $search = [];
        $replace = [];
        
        for($i = 0; $i < count($full_matches); $i++) {
            $match = $full_matches[$i];
            $word = $match[0];
            if(!$found[$word]) {
                $found[$word] = TRUE;
                $search[] = $word;
                $replacement = str_replace(['&#', ';'], ['HTML Entity #', ''], $word);
                $replace[] = $replacement;
            }
        }

        $new_output = str_replace($search, $replace, $output);
        
        return $new_output;
    }
    
    function convertNamedHTMLEntitiesToNumeric($args) {
        $input = $args['input'];
        return preg_replace_callback("/(&[a-zA-Z][a-zA-Z0-9]*;)/",function($m){
            $c = html_entity_decode($m[0],ENT_HTML5,"UTF-8");
            # return htmlentities($c,ENT_XML1,"UTF-8"); -- see update below
            
            $convmap = array(0x80, 0xffff, 0, 0xffff);
            return mb_encode_numericentity($c, $convmap, 'UTF-8');
        }, $input);
    }

print(FixUTF8(['input'=>"Oggi &egrave; un bel&nbsp;giorno"]));

Input:

"Oggi è un bel giorno"

Output:

Oggi HTML Entity #232 un belHTML Entity #160giorno

How to remove html special chars?

16 Answers16

What If By "Remove HTML Special Chars" You Meant "Replace Appropriately"?

How to SELECTIVELY-REPLACE HTML Special Chars

Working Demo

Linked