Under the assumption, that you just need the contents of math tags at top level without anything else and you so far can't do it, because math tags contain invalid xml and therefore any xml-parser gives up ... (as mentioned in question and comments)
The clean approach would probably be, to use some fault-tolerant xml-parser (or fault-tolerant mode) or Tidy up the input before. However, these approaches all might "corrupt" the content.
The hacky and possibly dirty approach would be the following, which might very well have other issues, especially if the remaining xml is also invalid or your math tags are nested (this will lead to the xml-parser failing in step 2):
- replace any
<math>.*</math>
(ungreedy) by a placeholder (preferably something unique uniqid
might help, but a simple counter is probably enough) via preg_replace_callback
or something
- parse the document with a common xml-parser (wrapping it in some root tag as necessary)
- fetch all child nodes of root node / all root nodes, see which ones were generated in step 1.
for example:
<math>some invalid xml</math>
<sometag>
<math>more invalid xml</math>
some text
</sometag>
replace with
$replacements = [];
$newcontent = preg_replace_callback(
'/'.preg_quote('<math>','/').'(.*)'.preg_quote('</math>','/').'/siU',
function($hit) use ($replacements) {
$id = uniqid();
$replacements[$id] = $hit[1];
return '<math id="'.$id.'" />';
},
$originalcontent);
which will turn your content into:
<math id="1stuniqid" />
<sometag>
<math id="2nduniqid" />
some text
</sometag>
now use the xml parser of your choice and select all root level/base level elements and look for /math/@id
(my XPath is possibly just wrong, adjust as needed). result should contain all uniqids, which you can look up in your replacement array
edit: some preg_quote
problems fixed and used more standard delimiters.