1

My question is similar to this question asked on Stackoverflow. But there is a difference.

I have the following stored in a MySQL table:

<p align="justify">First paragraph</p>
<p>Second paragraph</p>
<p>Third paragraph</p>
<div class="item">
<p>Some paragraph here</p>
<p><strong><u>Specs</u>:</strong><br /><br /><strong>Weight:</strong> 10kg<br /><br /><strong>LxWxH:</strong> 5mx1mx40cm</p
<p align="justify">second last para</p>
<p align="justify">This is the paragraph I am trying to remove with regex.</p>
</div>

I'm trying to remove the last paragraph tags and content on every row in the table. The best answer mentioned in the linked question suggests following regex -

preg_replace('~(.*)<p>.*?</p>~', '$1', $html)

The difference from linked question is - Sometimes my last paragraph tag may (or may not) have attributes align="justify". If the last last paragraph has this attribute, then mentioned solution removes the last paragraph of the content which does not have attributes. So, I am struggling to find a way to remove last paragraph, irrespective of its attributes status.

Community
  • 1
  • 1
Dr. Atul Tiwari
  • 1,085
  • 5
  • 22
  • 46
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Lucas Trzesniewski Jan 02 '16 at 13:57
  • @LucasTrzesniewski Thanks for the link. Although, I didn't understand it completely, I have bookmarked it. – Dr. Atul Tiwari Jan 02 '16 at 14:23
  • 1
    The link basically says you should use the right tool for the job. You need a HTML parser/DOM manipulation library here. Using regular expressions is brittle - you can do much better and easier with DOM (or XPath, or CSS selectors). – Lucas Trzesniewski Jan 02 '16 at 14:29
  • @LucasTrzesniewski Thanks for simplification. I will read about HTML parser/DOM manipulation. – Dr. Atul Tiwari Jan 02 '16 at 14:34

1 Answers1

2

Change the regex to:

preg_replace('~(.*)<p[^>]*>.*</p>\R?~s', '$1', $html)

Regex101 Demo

Regex Breakout

~           # Opening regex delimiter
  (.*)      # Select any chars matching till the last '<p>' tags
            # (actually it matches till the end then backtrack)
  <p[^>]*>  # select a '<p>' tag with any content inside '<p .... >'
            # the content chars after '<p' must not be the literal '>'
  .*        # select any char till the '</p>' closing tag
  </p>      # matches literal '</p>'
  \R?       # select (to remove it) any newline (\r\n, \r, \n)
~s          # Closing regex delimiter with 's' DOTALL flag 
            # (with 's' the '.' matches also newlines)
Giuseppe Ricupero
  • 6,134
  • 3
  • 23
  • 32