I have a use case where I have a large amount of text (an article body), and I need to identify and remove two paragraph elements that contain specific text. It's content that we want displayed on a web page, but not in an RSS feed that is used to provide content to another tool. The elements look like this:
<p style="text-align: center;"><strong><em><<< Please consider helping us financially with your tax-deductible contribution today >>></em></strong></p>
and
<p style="text-align: center;"><a href="https://www.example.com/join-the-movement?utm_source=website&utm_campaign=my_campaign&utm_medium=article&utm_term=2016&utm_content=my_utm_content
"><img alt="" class="image-blog_body-100" src="http://www.example.com/s3/files/styles/blog_body-100/s3/images/donatenowbuttonnb.jpg?itok=3h8SQb9v" style="width: 250px; height: 75px;" /></a></p>
I can't target the p tag by a specific attribute for either one, so it seems that the best way is to identify the unique content contained inside the block and then work my way back out.
So this works as a starting point to get the text between the arrows:
<<<\s[a-zA-z\s-]+\s>>>
but I'm having trouble trying to get the tags before that. I obviously need to get 3 sets of the open bracket, the tag, and the closing bracket. after that, I can use a backreference to get the closing ta. I tried this
^[<(p|em|strong)>]{1,3}<<\s[a-zA-z\s-]+\s>>>
but it's not working. What do I need to change to get those repeating tags (and the attribute text in the p tag)?
Thanks.
UPDATE: Following the suggestion by @b.enoit.be, I'm using PHP DOMDocument. I was able to modify the code that inserts the text I need to remove, and I was able to add an id value to the parent element, so that I could easily identify and remove it, e.g.:
<p id="donateButtonHeading" style="text-align: center;"><strong><em><<< Please consider helping us financially with your tax-deductible contribution today >>></em></strong></p>
getElementById works great to get a DOMElement object, but it looks like it gives me everything in parts, and what I need to get is the entire string to remove it, or just remove that whole element from the document. Here's what I'm trying ($body is teh HTML string):
$xmlDoc = new DOMDocument();
$xmlDoc->validateOnParse = true;
$xmlDoc->loadHTML($body);
foreach (array('donateButtonHeading', 'donateButtonMarkup') as $buttonElementId) {
$buttonElement = $xmlDoc->getElementById($buttonElementId);
}
What I'm having trouble figuring out is where to go from here. At this point $buttonElement is a DOMElement, but I need to remove that from $xmlDoc and then call $xmlDoc->saveHTML()
to get my HTML output. How do I get from having my DOMElement to removing it from $xmlDoc?
` could be found `` would not, nor would ``.
– chris85 Nov 02 '16 at 20:53