The input is a Wikipedia page's first paragraph. I want to remove anything between parentheses and the parentheses themselves.
However, at times (often), the HTML content inside parentheses itself contains one or several parentheses, generally in the href=""
of a link.
Take the following:
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
I want the end-result to be:
<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>
But when I use the below preg_replace
pattern it doesn't work, becomes it gets confused by the parentheses within parentheses.
public function removeParentheses( $content ) {
$pattern = '@\(.*?\)@';
$content = preg_replace( $pattern, '', $content );
$content = str_replace( ' .', '.', $content );
$content = str_replace( ' ', ' ', $content );
return $content;
}
Secondly, how can I leave the parentheses inside links' href=""
and title=""
? These, if not within a text parentheses are important.