0

I've written a regular expression to get the first two paragraphs from a database clob which stores its content in HTML formatting.

I've checked with these online RegEx builder/checkers here and here and they both seem to be doing what I want them to do (I've altered the RegEx slightly since these checkers to handle the new line formatting which I found after.

However when I go to use this in my PHP it doesn't seem to want to get just the group I'm after, and instead matches everything.

Here is my preg_replace line:

$description = preg_replace('/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/', "$2", $description);

And here is my testing content in the format of the content I am getting

<p> 
    Paragraph 1</p> 
<p> 
    Paragraph 2</p> 
<p> 
    Paragraph 3</p>

I've had a look at this SO Post which didn't help.

Any Ideas?

EDIT

As pointed out in one of the comments you cannot Regex HTML in PHP (Don't know why, I'm not really bothered by that).

Now I'm opening the option for getting it in PL/SQL as well.

select 
    DBMS_LOB.substr(description, 32000, 1) /* How do I make this into a regular expression? */
from
    blog_posts
Community
  • 1
  • 1
Zach Ross-Clyne
  • 779
  • 3
  • 10
  • 35
  • There's another SO post on this issue that's pretty well know, this one: http://stackoverflow.com/q/1732348/521598 – mishu Jul 14 '15 at 08:32
  • Why wouldn't a DOM parser work for you? Regex shouldn't be used to process HTML. – npinti Jul 14 '15 at 08:32
  • I'm only wanting to get the first 2 paragraphs from a CLOB in the database, but I don't want to change that CLOB because the following page will include all of it, its for a blog post preview. The text is coming back as pure HTML and I figured this should be done prior to loading it into the page. – Zach Ross-Clyne Jul 14 '15 at 08:35
  • @ZachRoss-Clyne I don't speak ;) php, but isn't the regex a mix between JavaScript and php? Shouldn't the / at the begining and the end be removed? Or if JS and php work the same, the ' should be removed. – SamWhan Jul 14 '15 at 08:51
  • Weirdly PHP Regex's are in `'`s and `/`s – Zach Ross-Clyne Jul 14 '15 at 09:00

2 Answers2

2

Your input contains newlines, therefore you have to add the s modifier:

/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/s

Otherwise, .* breaks on newlines and the regex doesn't match.

georg
  • 211,518
  • 52
  • 313
  • 390
1

You could take a look at the PHP Simple DOM Parser. Going by their manual, you could do something like so:

$html = str_get_html('your html string');
foreach($html->find('p') as $element)   //This should get all the paragraph elements in your string.
       echo $element->plaintext. '<br>';
npinti
  • 51,780
  • 5
  • 72
  • 96
  • Its cool I've changed it to do the Regular Expression in the PL/SQL Select Statement rather than in the PHP as per the point @mishu said – Zach Ross-Clyne Jul 14 '15 at 08:53
  • @ZachRoss-Clyne: It is still not recommended that you process HTML through the usage of regular expression, regardless at which level this is done. – npinti Jul 14 '15 at 09:03
  • I don't fully understand why. If I need the first 2 paragraphs of a string that is formatted in HTML there is no other way to do this. I have tried simply reading the text but that doesn't work because if I am half way through a paragraph when the read cuts out then I am left with a broken page. – Zach Ross-Clyne Jul 14 '15 at 09:26
  • Changed my route to use this, seems pretty respectable. – Zach Ross-Clyne Jul 14 '15 at 09:33
  • @ZachRoss-Clyne: To understand this, you will need to look at the first comment under your answer. HTML is not a regular language, it does not have a strict format, things can be missing and the HTML would still render. Although you *could* make an expression which parses an HTML segment, it will not be robust enough and will most likely break if slightly different, yet still valid HTML is used. – npinti Jul 14 '15 at 09:52
  • Okay, given that I know the exact format that is being provided to the function, I could theoretically use it and would work every time. – Zach Ross-Clyne Jul 14 '15 at 10:11
  • @ZachRoss-Clyne: I unserstand what you mean, but then you would be parsing a subset of what HTML could provide. If in time the specification changes, it would still be considered HTML but then, your HTML parsing expression would in essence, be broken. – npinti Jul 14 '15 at 10:21
  • @npinti: as an aside `simplehtmldom` code is full of regex. That's why I avoid it in favor of `DOMDocument`. – Casimir et Hippolyte Jul 14 '15 at 10:22