Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

Question

I have this block of html:

<div>
  <p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>

I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.

This is what I have so far, but it selects the contents of the first paragraph contained above.

/<p>(.+?)<\/p>/is

Thanks!

EDIT

Unfortunately, I don't have the luxury of a DOM Parser.

I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):

#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]

I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.

Joke link is relevant for once! (not as duplicate, mind you) — mario, Dec 13 '11 at 22:31
I'm assuming that the block you posted is contained in some other element? — Levi Morrison, Dec 13 '11 at 22:34
Asking how to parse HTML with Regex? You'll get eaten alive for that... seriously though, [DOMDocument](http://php.net/manual/en/class.domdocument.php) will do you much better here. — DaveRandom, Dec 13 '11 at 22:38
Will nested paragraphs always be indented? Will paragraphs only ever span one line? Will there only ever be a paragraph on that line? If so, just look for the opening tag at the start of the line and match that whole line. `/^
.+/m` If that is not sufficient, please **detail your requirements fully**. — salathe, Dec 13 '11 at 23:01
Added clarification. I have some pretty messy data to dig through so non-joke answers are appreciated. — Workman, Dec 13 '11 at 23:30
`preg_match('_(^[^<>]*|\w+>\s*)
(.+?)<\/p>_is'` might work if the html block structure is always similar to your shown example. Result in `[2]` and some prefix will remain as you cannot use an assertion for that. Otherwise you will need a recursive `(?R)` regex... (Add a bounty if you need that.) -- Using [QueryPath](http://stackoverflow.com/questions/tagged/QueryPath) would be so much simpler `htmlqp($html)->find("p")->not("div p");` or [SimpleHtmlDom](http://stackoverflow.com/questions/tagged/SimpleHtmlDom) for older PHP servers without DOM support. — mario, Dec 13 '11 at 23:48

score 2 · Accepted Answer · answered Dec 13 '11 at 22:39

2

Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..

Try:

<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>

and grab \2 if \1 is p.

But an HTML parser would do a better job of that imho.

answered Dec 13 '11 at 22:39

fge

119,121
33
254
329

Try to run it, it definitely selects "First, nested paragraph." -- "First, non-nested ... Last paragraph." will be selected if you take the '?' out of what I have – Workman Dec 13 '11 at 22:45
Oops, yes, I didn't see the `?`. But I hate lazy quantifiers, maybe I just didn't see it in my mind ;) – fge Dec 13 '11 at 22:50
Thanks for your relevant answer, I added some clarification to give a more clear idea of why I used a HTML example for this question. I'm working with your example and making progress. Thanks. – Workman Dec 13 '11 at 23:30
Now that I see your data though, I wonder why you first try and match lines which *do not* interest you? Divide and conquer... If your input is really line-oriented then just matching against `^#` will match quite a few lines you don't want, if I understand correctly? – fge Dec 13 '11 at 23:32
They would if they were actually on separate lines :/ A lot of this mess is minified and some of these files are single lines. It's a mess. Though I may have to perform several steps (As you say, Divide and conquer) just to make enough sense of some of these files. Instead of introducing my whole parsing task to SO, I just introduced the example. ]s and ending #s can essentially be considered line breaks. – Workman Dec 13 '11 at 23:36

score 2 · Answer 2 · answered Dec 14 '11 at 00:25

How about something like this?

<p>([^<>]+)<\/p>(?=(<[^\/]|$))

Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).

score 1 · Answer 3 · answered Dec 13 '11 at 23:58

Use a ~~two~~ three step process. First, pray that everything is well formed. Second, ~~First,~~ remove everything that is nested.

s{<div>.*?</div>}{}g;         # HTML example
s/#.*?#//g;                   # 2nd example

Then get your result. Everything that is left is now not nested.

$result = m{<p>(.*?)</p>};    # HTML example
$result = m{\[(.*?)\]};       # 2nd example

(this is Perl. Don't know how different it would look in PHP).

score 1 · Answer 4 · edited May 23 '17 at 12:03

"You shouldn't use regex to parse HTML."

It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.

To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.

$myHtml = <<<MARKUP
   <html>
       <head>
            <title>something</title></head>
       <body>
            <div>
                <p>not valid</p>
            </div>
            <p>is valid</p>
            <p>is not valid</p>
            <p>is not valid either</p>
            <div>
                <p>definitely not valid</p>
            </div>
       </body>
   </html>
MARKUP;

$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));

var_dump($yourNode)

// output '<p>is valid</p>'

score 0 · Answer 5 · edited May 23 '17 at 12:03

0

You might want to have a look at this post about parsing HTML with Regex.

Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.

edited May 23 '17 at 12:03

Community

1
1

answered Dec 13 '11 at 22:42

Mr. Llama

20,202
2
62
115

(Not to deny anyone the entertainment. But even the non-joke answers there are mostly wrong. So please post as comment or community wiki when tag-off-topic). Regular expressions, despite the name, aren't regular. They are context-free. It's just that the effort is [prohibitive](http://stackoverflow.com/a/4234491/345031) for making it reliable for arbitrary html. – mario Dec 13 '11 at 22:51

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

5 Answers5