3

The input is a Wikipedia page's first paragraph. I want to remove anything between parentheses and the parentheses themselves.

However, at times (often), the HTML content inside parentheses itself contains one or several parentheses, generally in the href="" of a link.

Take the following:

<p>
    The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>

I want the end-result to be:

<p>
    The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>

But when I use the below preg_replace pattern it doesn't work, becomes it gets confused by the parentheses within parentheses.

public function removeParentheses( $content ) {

    $pattern = '@\(.*?\)@';
    $content = preg_replace( $pattern, '', $content );
    $content = str_replace( ' .', '.', $content );
    $content = str_replace( '  ', ' ', $content );
    return $content;
}

Secondly, how can I leave the parentheses inside links' href="" and title=""? These, if not within a text parentheses are important.

beaver
  • 523
  • 1
  • 9
  • 20
Lazhar
  • 1,401
  • 16
  • 37
  • 1
    regular expressions cant handle recursion. If you have some recursive patterns (parens inside parens..) you need more logic - i.e. write an parser – Philipp Oct 18 '17 at 16:05
  • 1
    Do not parse HTML with regex. As @Philipp mentioned, it cannot do this effectively (sure you can hack together a version that works, but I guarantee you it can be broken by some obscure thing in HTML). Use an XML parser like [SimpleXML](http://php.net/manual/en/simplexml.examples.php) – ctwheels Oct 18 '17 at 16:07
  • you may want to reference https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php for a list of tools to use if trying to parse html with php – Jeff Oct 18 '17 at 16:54

1 Answers1

2

You can replace all the links with a placeholder, then remove all parentheses, and at the end replace the placeholders back to their original values.

This is accomplished with preg_replace_callback(), passing a occurrences counter and a replacements array to keep track of the links, then using your own removeParentheses() to get rid of the parentheses, and finally using str_replace() with array_keys() and array_values() to get your links back:

<?php
$string = '<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>';
$occurrences = 0;
$replacements = [];
$replacedString = preg_replace_callback("/<a .*?>.*?<\/a>/i", function($el) use (&$occurrences, &$replacements) {
    $replacements["|||".$occurrences] = $el[0]; // the ||| are just to avoid unwanted matches
    return "|||".$occurrences++;
}, $string);
function removeParentheses( $content ) {
    $pattern = '@\(.*?\)@';
    $content = preg_replace( $pattern, '', $content );
    $content = str_replace( ' .', '.', $content );
    $content = str_replace( '  ', ' ', $content );
    return $content;
}
$replacedString = removeParentheses($replacedString);
$replacedString = str_replace(array_keys($replacements), array_values($replacements), $replacedString); // get your links back
echo $replacedString;

Demo

Result

<p>
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>.
</p>

This is however a bit brittle in my opinion. As others told you in the comments, you shouldn't parse HTML with regular expressions. A lot can change and you can get unexpected results. This might get you in the right direction though.

edit regarding the parentheses within parentheses, you can use a recursive pattern. Take a look at this great answer by Bart Kiers:

function removeParentheses( $content ) {
    $pattern = '@\(([^()]|(?R))*\)@';
    $content = preg_replace( $pattern, '', $content );
    $content = str_replace( ' .', '.', $content );
    $content = str_replace( '  ', ' ', $content );
    return $content;
}

Demo

ishegg
  • 9,685
  • 3
  • 16
  • 31
  • This does not handle the problem of brackets within brackets as the user requested. Just the problem of the brackets in links. https://3v4l.org/VDebj – Jeff Oct 18 '17 at 16:35
  • @Jeff Thanks. It does now. – ishegg Oct 18 '17 at 16:42