Regex is Capturing Everything Instead of Just the "Wildcard" in Parenthesis - How to Fix it?

Question

I have the following string:

<p class=MsoNormal><b>Customer Email: <o:p></o:p></b></p></td><td width=""75%"" valign=top style='width:75.0%;border:none;padding:0in 0in 11.25pt 0in'><p class=MsoNormal><a href=""mailto:username@gmail.com""">

I'm trying to capture just the email address (username@gmail.com) from the above string using regex. I'm using the following regex:

Customer Email.*?mailto:(.*?)"

Testing the above regex in Notepad++, instead of it just matching the email address, it is matching everything from (and including) "Customer Email" all the way to the " just after the email address.

I need the regex to just match the email address, and the platform that the regex has to do that in is Notepad++

Any ideas on why it is matching everything instead of just what it should be matching in the (.*?)

Not all regular expression engines are the same. Tag with the applicable environment(s). In this case, search for "look behind" or "capture group". — , Mar 19 '13 at 02:14
What do you want to do with the email address in Notepad++? If you are doing a search and replace, you can use `\1` to refer to what is captured by `(.*?)` — azhrei, Mar 19 '13 at 02:15
It's not that I actually want to do it within notepad, I want to do it within uBot and uBot and notepad++ return the same results when using the same RegEx. In other words, I know nothing about "regex engines" but I do know that the how the RegEx works in notepad++ and uBot are the same, and I thought it would be too confusing to say I need a RegEx that works in uBot since nobody would know what I was talking about. — Learning, Mar 19 '13 at 02:21
@Learning Since you're using uBot, is a JavaScript solution also OK? — Benjamin Gruenbaum, Mar 19 '13 at 02:23
Yes, it is. In fact I'm much more comfortable with JavaScript...unless we're talking regex & javascript, in which case yes, that will work, but I'm more or less clueless. — Learning, Mar 19 '13 at 02:24
Even better, is a C# solution acceptable? Since you're using vb.net it'll be very easy to port — Benjamin Gruenbaum, Mar 19 '13 at 02:25
That, I can't comment on...I know I can use javascript within uBot as I do it all the time...I couldn't write "hello world" in C# if you paid me (and didn't let me use Google). — Learning, Mar 19 '13 at 02:30

score 2 · Accepted Answer · edited May 23 '17 at 12:21

2

Since you're able to use JavaScript I would suggest the following solution. I think it is better than Regular Expressions which should NOT be used to parse HTML any way.

Here is how I would do it in JavaScript

var a = document.createElement("div"); //create a wrapper
a.innerHTML = '<p class=MsoNormal><b>Customer Email: <o:p></o:p></b></p></td><td width="\"75%\"" valign=top style=\'width:75.0%;border:none;padding:0in 0in 11.25pt 0in\'><p class=MsoNormal><a href="mailto:username@gmail.com">'; //your data
var ps = a.querySelectorAll("p"); //get all the p tags
var emails = [];
[].forEach.call(ps,function (pTag) { //for each p tag
    if(pTag.textContent.indexOf("Customer Email")===-1){
        return;//only add those with Costumer Email
    }
    var as= (pTag.querySelectorAll("a")); //get the links from it
    [].forEach.call(as,function(aTag){
        if(aTag.href && aTag.href.substring(0,7)==="mailto:"){ //for mailto links
           //got a match
               emails.push(aTag.href.substring(7)); //add the email address
        }
   });
});
console.log(emails); //emails now contains an array of your extracted emails

See this question on why it is a better approach than using Regular Expressions.

In Regular Expressions this is done usually with a lookbehind

(?<=Customer Email.*?mailto:)(.*?)(?=")

edited May 23 '17 at 12:21

Community

1
1

answered Mar 19 '13 at 02:08

Benjamin Gruenbaum

270,886
87
504
504

When I use that regex within notepad++ on the string given in my original question, it matches the following: :
– Learning Mar 19 '13 at 02:15
See updated answer suggesting a different approach assuming JavaScript is legal as you've stated in the comments, you should edit your question to reflect that – Benjamin Gruenbaum Mar 19 '13 at 02:42
I actually ended up selecting your answer as the correct one anyway, as the "lookbehind" solution you suggested turned out to work very well, with some minor modifications. The following RegEx ended up working: (?<=Customer Email.*?mailto:)(.*?)(?=") – Learning Mar 19 '13 at 02:59
I'm glad I could help, I've updated the answer to the RegExp you ended up using. Note I still suggest the HTML parsing approach in my JavaScript answer. I scrape a lot of pages and it has consistently worked out better than using RegExp for HTML parsing. – Benjamin Gruenbaum Mar 19 '13 at 03:01
@BenjaminGruenbaum that regex is invalid, check out my answer for details – CSᵠ Mar 19 '13 at 03:03
@kaᵠ That RegExp is what the OP ended up using to I undated to it. It is apperantly _not_ invalid in the dialect OP is using. – Benjamin Gruenbaum Mar 19 '13 at 10:31

score 0 · Answer 2 · answered Mar 19 '13 at 02:09

0

What it matches and what it captures are entirely different things. It will only capture what's in the capturing group. Try actually using it in code.

answered Mar 19 '13 at 02:09

hobbs

223,387
19
210
288

1

I'm a little hazy on what you mean...so how do I actually match just the email address? I'm basing this regex in my original question off of the answer I found here: http://stackoverflow.com/questions/8652039/regex-match-between – Learning Mar 19 '13 at 02:14
@Learning: In that [other question](http://stackoverflow.com/a/8652117/20938), the regex *matches* (or *consumes*) the whole `` element (including the tags), but it also *captures* the stuff between the tags. In most regex-powered tools you would be able to grab just the captured portion instead of the whole match, but that doesn't seem to be the case with uBot. – Alan Moore Mar 19 '13 at 08:06

score -1 · Answer 3 · answered Mar 19 '13 at 02:10

-1

The * is wild character. So it is going to match everything. ? is one wild character. So thats why it is matching everything.

answered Mar 19 '13 at 02:10

mysteriousboy

159
1
7
20

You're mixing RegExp up with wildcards. * is zero or more, ? is zero or once – Benjamin Gruenbaum Mar 19 '13 at 02:11

Regex is Capturing Everything Instead of Just the "Wildcard" in Parenthesis - How to Fix it?

3 Answers3