0

I have a string that contains a(n) HTML page. I need only the content inside < body > and < /body > and want to remove all inline HTML properties except colspan. Here is what I achieved (still has colspan properties removed):

<?php
$html = 'CURL GET THE HTML (mostly just tables)';

// Remove HTML comments, JavaScript content, CSS and not needed HTML tags
$pregReplacePattern = array(
    '/<!--(.*)-->/Uis',
    '#<.*?!DOCTYPE.*?>#i',
    '#<.*?html.*?>#i',
    '#<.*?head.*?>#i',
    '#<title.*?>.*?</title>#i',
    '#<.*?meta.*?>#i',
    '#<script.*?>.*?</script#i',
    '#<.*?link.*?>#i',
    '#<.*?body.*?>#i',
    '#<.*?form.*?>#i',
    '#<img.*?>#i',
    '"/<img[^>]+\>/i"',
);
$pregReplaceTo = array_fill_keys(
    range(0, count($pregReplacePattern) - 1), ''
);
$html = preg_replace($pregReplacePattern, $pregReplaceTo, $html);

// Remove inline HTML properties (all of them)
$html = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i", '<$1$2>', $html);

Can anyone of you help me?

Thanks in advance...

hassan
  • 7,812
  • 2
  • 25
  • 36

1 Answers1

-1

Maybe you should just use echo satements in your php and not bother about regex. Here is what i mean:

<?php
echo "// Remove HTML comments, JavaScript content, CSS and not needed HTML 
tags
$pregReplacePattern = array(
'/<!--(.*)-->/Uis',
'#<.*?!DOCTYPE.*?>#i',
'#<.*?html.*?>#i',
'#<.*?head.*?>#i',
'#<title.*?>.*?</title>#i',
'#<.*?meta.*?>#i',
'#<script.*?>.*?</script#i',
'#<.*?link.*?>#i',
'#<.*?body.*?>#i',
'#<.*?form.*?>#i',
'#<img.*?>#i',
'"/<img[^>]+\>/i"',
);
$pregReplaceTo = array_fill_keys(
range(0, count($pregReplacePattern) - 1), ''
);
$html = preg_replace($pregReplacePattern, $pregReplaceTo, $html);

// Remove inline HTML properties (all of them)
$html = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i", '<$1$2>', $html);";
?>

It works fine on IE 9.1

Juan K
  • 1
  • 1