RegEx: Matching Pattern within Pattern - I think I need to use Positive Lookbehinds?

Question

I'm trying to use RegEx to find a pattern within a pattern. Specifically what I want to do is capture a URL into a reference and search within that for everything that comes after the last = sign and capture that as well.

So given this string

<a href="http://my.domain.com/?s_cid=EM&s_ev9=CMC21892&s_ev10=EM_CMC21892_LC_stuff" style="color: #365EBF:">stuff</a>

I would initially find

href="http://my.domain.com/?s_cid=EM&s_ev9=CMC21892&s_ev10=EM_CMC21892_LC_stuff"

Using this RegEx: href="(https?[^"]*)"

From there I could parse the actual string (when looking at the captured group) I'm looking for EM_CMC21892_LC_stuff with this: =[^"=]*$

I am having no success though when I try to combine the two to accomplish it in one RegEx.

Any thoughts?

Why do you want to use regular expressions here? Doesn't the language you are using have an HTML parsing library or a URL parsing library? — Mark Byers, Feb 01 '11 at 00:12
Well, I'm trying to get better using Regular Expressions so I wanted to see if it's possible. The other reasons are I'm not sure if the language (RealStudio) has a parsing library that will handle. This is an update to something I've worked on in the past and I do a bunch of strange find/replace based on other factors and the found patterns and at that time RegEx was my best option. — dscl, Feb 01 '11 at 00:24
Yes, certainly it is possible. All things are possible, but not all are expedient. — tchrist, Feb 01 '11 at 00:42
To show what is simultaneously *possible* yet in no fashion *expedient*, read [this testament against (mortals’:) using regexes on HTML](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). If you grok that example perfectly well, then surely such simplistic tasks as [parsing email addresses per RFC 5322](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579) will be a breeze. **HINT:** There is a lot more to pattern matching than most people are apt to learn in a day. — tchrist, Feb 01 '11 at 00:46

score 0 · Accepted Answer · answered Feb 01 '11 at 00:26

0

He's right, using regexes to parse HTML is just asking for trouble.

That said, try href="http[^"]+=([^"]+?)" .

answered Feb 01 '11 at 00:26

albert

18
1

No, not asking for trouble: asking for a *serious* education. ☺ – tchrist Feb 01 '11 at 00:49

score 0 · Answer 2 · answered Feb 01 '11 at 00:26

I agree with Mark Byer's comment about using existing html/url parsing functions instead of regex (though you didn't specify which language you are using so we can't really help on that...)

However, if you insist on doing it the regex way, here is a pattern:

/href="([^"]*=([^"]*))"/

edit to add: here is what the result would looks like, wasn't sure if you wanted to still capture the full url or just that last param value, but this pattern captures both:

Array
(
    [0] => Array
        (
            [0] => href="http://my.domain.com/?s_cid=EM&s_ev9=CMC21892&s_ev10=EM_CMC21892_LC_stuff"
        )

    [1] => Array
        (
            [0] => http://my.domain.com/?s_cid=EM&s_ev9=CMC21892&s_ev10=EM_CMC21892_LC_stuff
        )

    [2] => Array
        (
            [0] => EM_CMC21892_LC_stuff
        )

)

RegEx: Matching Pattern within Pattern - I think I need to use Positive Lookbehinds?

2 Answers2