It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.
The problem is that you have poorly formatted html on which you have no control.
There're two possible attitudes:
- There's nothing to do: the data are corrupted, so informations are loss once and for all and you can't retrieve something that has disappear, that's all. This is a perfectly acceptable point of view.
May be you can find another source for the same data somewhere or you can choose to print the poorly formatted html as it.
- You can try to repair. In this case you have to ensure that all the document problems are limited and can be solved (at least by hand).
In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument
. Even if the libxml parser will not give better results than strip_tags
, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.
With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED
with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors()
.
Example with your string:
$s = '<p>I am <30 years old and weight <12st</p>';
$libxmlErrorState = libxml_use_internal_errors(true);
function getLastErrorPos($code) {
$errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
return $e->code === $code;
});
if ( !$errors )
return false;
$lastError = array_pop($errors);
return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
libxml_clear_errors();
$pattern = vsprintf($patternTemplate, $position);
$s = preg_replace($pattern, '<', $s, 1);
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}
echo $dom->saveHTML();
libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);
demo
$patternTemplate
is a formatted string (see sprintf
in the php manual) in which the placeholders %d
stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)
Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.
~ # my favorite pattern delimiter
(?:
.* # all character until the end of the line
\R # the newline sequence
){0} # reach the desired line
.{8} # reach the desired column
\K # remove all on the left from the match result
< # the match result is only this character
~A # anchor the pattern at the start of the string
An other related question in which I used a similar technique: parse invalid XML manually
into <p> so wouldn't be stripped out with strip_tags
– Maartyn Jun 24 '19 at 11:35