0

I have a use case where I have a large amount of text (an article body), and I need to identify and remove two paragraph elements that contain specific text. It's content that we want displayed on a web page, but not in an RSS feed that is used to provide content to another tool. The elements look like this:

<p style="text-align: center;"><strong><em><<< Please consider helping us financially with your tax-deductible contribution today >>></em></strong></p>

and

<p style="text-align: center;"><a href="https://www.example.com/join-the-movement?utm_source=website&amp;utm_campaign=my_campaign&amp;utm_medium=article&amp;utm_term=2016&amp;utm_content=my_utm_content
"><img alt="" class="image-blog_body-100" src="http://www.example.com/s3/files/styles/blog_body-100/s3/images/donatenowbuttonnb.jpg?itok=3h8SQb9v" style="width: 250px; height: 75px;" /></a></p>

I can't target the p tag by a specific attribute for either one, so it seems that the best way is to identify the unique content contained inside the block and then work my way back out.

So this works as a starting point to get the text between the arrows:

<<<\s[a-zA-z\s-]+\s>>>

but I'm having trouble trying to get the tags before that. I obviously need to get 3 sets of the open bracket, the tag, and the closing bracket. after that, I can use a backreference to get the closing ta. I tried this

^[<(p|em|strong)>]{1,3}<<\s[a-zA-z\s-]+\s>>>

but it's not working. What do I need to change to get those repeating tags (and the attribute text in the p tag)?

Thanks.

UPDATE: Following the suggestion by @b.enoit.be, I'm using PHP DOMDocument. I was able to modify the code that inserts the text I need to remove, and I was able to add an id value to the parent element, so that I could easily identify and remove it, e.g.:

<p id="donateButtonHeading" style="text-align: center;"><strong><em><<< Please consider helping us financially with your tax-deductible contribution today >>></em></strong></p>

getElementById works great to get a DOMElement object, but it looks like it gives me everything in parts, and what I need to get is the entire string to remove it, or just remove that whole element from the document. Here's what I'm trying ($body is teh HTML string):

$xmlDoc = new DOMDocument();
$xmlDoc->validateOnParse = true;
$xmlDoc->loadHTML($body);
  foreach (array('donateButtonHeading', 'donateButtonMarkup') as $buttonElementId) {
    $buttonElement = $xmlDoc->getElementById($buttonElementId);

  }

What I'm having trouble figuring out is where to go from here. At this point $buttonElement is a DOMElement, but I need to remove that from $xmlDoc and then call $xmlDoc->saveHTML()to get my HTML output. How do I get from having my DOMElement to removing it from $xmlDoc?

wonder95
  • 3,825
  • 8
  • 45
  • 74
  • Character classes are for characters, not words, so `<(p|em|strong)>` are all individual characters, not an element. `

    ` could be found `` would not, nor would ``.

    – chris85 Nov 02 '16 at 20:53
  • Did you consider the fact that regex might not be [the best approach](http://stackoverflow.com/a/1732454/2123530), but [a DOMDocument could know better](http://php.net/manual/en/class.domdocument.php) ? – β.εηοιτ.βε Nov 02 '16 at 21:47

2 Answers2

0

Use phpQuery or queryPath:

phpQuery example:

$html = phpQuery::newDocumentHTML(
    '<div>New Test!!!</div><p style="text-align: center;"><strong><em>&lt;&lt;&lt; Please consider helping us financially with your tax-deductible contribution today &gt;&gt;&gt;</em></strong></p><p>Some paragraph</p>'
);
$html->find('p:contains("Please consider helping us financially with your tax-deductible contribution today")')->remove();
return $html->html();
$html = phpQuery::newDocumentHTML(
    '<p>Entry paragraph</p><p style="text-align: center;"><a href="https://www.example.com/join-the-movement?utm_source=website&amp;utm_campaign=my_campaign&amp;utm_medium=article&amp;utm_term=2016&amp;utm_content=my_utm_content"><img alt="" class="image-blog_body-100" src="http://www.example.com/s3/files/styles/blog_body-100/s3/images/donatenowbuttonnb.jpg?itok=3h8SQb9v" style="width: 250px; height: 75px;" /></a></p><div>This is a test div</div>'
);
$html->find('p a[href*="https://www.example.com/join-the-movement?"')->parent()->remove();
return $html->html();
Christos Lytras
  • 36,310
  • 4
  • 80
  • 113
-1

I think you'd benefit from a little more freedom in your expression. Try this:

/(?:<(?:p|strong|em)\s*(?:[a-z]+=".+")?>){1,3}<<<\s*[a-z\s-]+\s*>>>(?:<\/(?:p|strong|em)\s*>){1,3}/gi

Note that (?:) represents a non-capturing group. If you want to store the tag type or something, remove ?: and that part of the match will be stored. You might also consider wrapping the whole thing in a matching group so it can be manipulated further.

https://regex101.com/r/DihfUt/2

Dominic Aquilina
  • 617
  • 3
  • 13