I have a string that contains a(n) HTML page. I need only the content inside < body > and < /body > and want to remove all inline HTML properties except colspan. Here is what I achieved (still has colspan properties removed):
<?php
$html = 'CURL GET THE HTML (mostly just tables)';
// Remove HTML comments, JavaScript content, CSS and not needed HTML tags
$pregReplacePattern = array(
'/<!--(.*)-->/Uis',
'#<.*?!DOCTYPE.*?>#i',
'#<.*?html.*?>#i',
'#<.*?head.*?>#i',
'#<title.*?>.*?</title>#i',
'#<.*?meta.*?>#i',
'#<script.*?>.*?</script#i',
'#<.*?link.*?>#i',
'#<.*?body.*?>#i',
'#<.*?form.*?>#i',
'#<img.*?>#i',
'"/<img[^>]+\>/i"',
);
$pregReplaceTo = array_fill_keys(
range(0, count($pregReplacePattern) - 1), ''
);
$html = preg_replace($pregReplacePattern, $pregReplaceTo, $html);
// Remove inline HTML properties (all of them)
$html = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i", '<$1$2>', $html);
Can anyone of you help me?
Thanks in advance...