Regular Expression Lookbehind doesn't work as expected

Question

I have a string in .net.

<p class='p1'>Para 1</p><p>Para 2</p><p class="p2">Para 3</p><p>Para 4</p>

Now, I want to get only text inside the tag p (Para 1, Para 2, Para 3, Para4).

I used the following regular expression but it doesn't give me expected result.

(?<=<p.*>).*?(?=</p>)

If I use (?<=).*?(?=) it will give Para 2 and Para 4 which both p tags doesn't have class attribute?

I'd like to know what's wrong with (?<=<p.*>).*?(?=) that code.

Looks to me like you're parsing HTML with regex. Please read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) (Best viewed in browser which supports unicode :)) — El Ronnoco, Nov 01 '11 at 09:59

score 5 · Accepted Answer · edited Nov 27 '17 at 01:04

5

Let's illustrate this using RegexBuddy:

RegexBuddy Screenshot

Your regex matches more than you think - the dot matches any character, so it doesn't care about tag boundaries.

What it is actually doing:

(?<=<p.*>): Assert that there is <p (followed by any number of characters) anywhere in the string before the current position, followed by a >.
.*?: Match any number of characters...
(?=): ...until the next occurence of .

Your question is a bit unclear, but if your plan is to find text within  tags regardless of whether they contain any attributes, you shouldn't be using regular expressions anyway but a DOM parser, for example the HTML agility pack.

That said, if you insist on a regex, try

(?<=<p[^<>]*>)(?:(?!</p>).)*

Another screenshot

Explanation:

(?<=<p[^<>]*>)  # Assert position right after a p tag
(?:(?!</p>).)*  # Match any number of characters until the next </p>

edited Nov 27 '17 at 01:04

carla

1,970
1
31
44

answered Nov 01 '11 at 09:56

Tim Pietzcker

328,213
58
503
561

I think his requirement is to match and find the text inside the paragraph tag regardless of whether they contain a class or any other attribute. Either way, he's better off using something like HTML Agility pack :) – Ranhiru Jude Cooray Nov 01 '11 at 10:01
Thanks. I've changed the code to (?<=).*?(?=) by adding ? in the lookbehind making it optional. But it seems like it doesn't work yet. – lil master Nov 01 '11 at 10:03
@lilmaster: What are you *actually* trying to achieve? The lookbehind is useless in its current form anyway, it doesn't matter if you make it optional. I've just added an explanation why this is so. – Tim Pietzcker Nov 01 '11 at 10:05
@TimPietzcker I want to get the text between each p tags. It's my goal. If I use (?<=
).*?(?=
), it will return text between p tags without class attribute. But in my case, I want to get text between p tags with or without class attribute. – lil master Nov 01 '11 at 10:10
@TimPietzcker It works great. Thanks for your help. I've to learn more. – lil master Nov 01 '11 at 10:14

score 1 · Answer 2 · answered Nov 01 '11 at 09:57

1

Have you tried using following expression?

<p[\s\S]*?>(?<text_inside_p>[\s\S]*?)</p>

group named text_inside_p will contain desired text.

answered Nov 01 '11 at 09:57

Shekhar

11,438
36
130
186

out of interest does [\s\S] omit things that . would include? – Chris Nov 01 '11 at 10:03
`[\s\S]` matches any character - whitespace | non-whitespace. So will include linebreaks etc... Though in C# regex in multiline mode I think `.` will match newline anyway... Though I may be wrong. – El Ronnoco Nov 01 '11 at 10:05
@ElRonnoco: you're right about .NET, except it's `Singleline` mode that beefs up the dot, not `Multiline`. – Alan Moore Nov 01 '11 at 13:54

Regular Expression Lookbehind doesn't work as expected

2 Answers2