3

I am trying to match all

<a href="mailto:abc@abc.com">bla bla bla</a>

and I have another filter that will append

<a rel="email" href="mailto:abc@abc.com">bla bla bla</a>

So I am looking for the regular expression that will find that with the replace function.

Tom Zych
  • 13,329
  • 9
  • 36
  • 53
Taha
  • 87
  • 1
  • 7
  • 5
    What language are you using and what flavour of regex does it come with? – Andy E Jan 17 '11 at 23:19
  • 1
    No. HTML is not a regular language, so regular expressions are not the tool to use. You should use a parser instead. A streaming parser (e.g. SAX) will solve this problem with maximum efficiency. – OrangeDog Jan 17 '11 at 23:23
  • @OrangeDog: PCRE regexp do not require a language to be regular in order to do some fairly complex stuff with. The comment only applies if you are trying to parse some nested construct generally. Something simple like this should not be a particularly tall order. – Orbling Jan 17 '11 at 23:26
  • 2
    In your case, it will probably be enough to replace `$2` where you add the `rel` attribute in the "replace with" field, and consult your program manual on what placeholder to use instead of `$n` (look for "capture", "group" or "label", that's what these things are called …) – Felix Dombek Jan 17 '11 at 23:29
  • @Orbling What when you get something like `bc@abc.com">bla bla bla` ? – moinudin Jan 17 '11 at 23:30
  • 1
    @OrangeDog: Orbling is completely right, OP didn't say anything about parsing. S/he just wants to manipulate strings. Any modern flavour of regexes allows exactly what s/he wants. – Felix Dombek Jan 17 '11 at 23:34
  • 1
    @marcog: c'mon, how many email addresses with `"` in them have you seen? But anyway, my idea would still work with that -- `$1 == mailto:a\, $2 == bc@abc.com">bla bla bla` – Felix Dombek Jan 17 '11 at 23:40
  • @marcog: I don't believe speech marks " are valid in email addresses. But even if they were, you can tell it to match only a " without an escape. In this example that is not necessary anyhow. – Orbling Jan 17 '11 at 23:41
  • @Felix It's still valid html. There are far more reasons though: What if `rel` and `href` are the other way around? Additional attributes. Single quotes or no quotes? The `` tag quoted? Lots of things can go wrong when parsing html with a regex. – moinudin Jan 17 '11 at 23:45
  • @amarcog, I've just checked the spec and only `'` is allowed within email addresses, not `"`, except in an extremely rare square-bracketed unicode form which is deprecated in the standard. However, on the topic: If OP knows what s/he has written, then it's no problem to find a regex which handles exactly that. Also, modern flavours of regexes are strictly more powerful than regular languages. I'm doing this stuff with regexes all the time and it is usually the easiest thing – Felix Dombek Jan 17 '11 at 23:55
  • 1
    @Felix - If the OP had written the HTML to start with then they would (hopefully) just use Find/Replace in their IDE. One assumes that they are actually processing 3rd-party HTML, which could be of any form. If you care to post a regex you would suggest, I could find at least two valid cases that it would not work for. – OrangeDog Jan 18 '11 at 00:21
  • Fair enough. Regex for Microsoft Expression Web: search field `([^<]*)` and replace field `\2` and I'm aware that no ` " ` s are allowed in the email address and no other tags inside the link, if you just want to prohibit other `a` tags then it is considerably more difficult but I could do it (regular languages are closed under complement, therefore, it is possible). It is probably much less of a hassle than to learn a completely new API and write a whole executable program for it. – Felix Dombek Jan 18 '11 at 00:41
  • @OrangeDog: Not even POSIX-standard regexes are ʀᴇɢᴜʟᴀʀ you know. So what? And plenty of folks don’t write HTML using IDE video games, either. – tchrist Jan 18 '11 at 01:02
  • @tchrist - Yes I know that, but they still can't parse HTML. Also, unless you're still programming on punch cards, you're going to have access to a Find/Replace function. Even vi has one. – OrangeDog Jan 18 '11 at 09:46
  • @Felix - `bla bla bla` and `bla bla bla`. I thought you could have made it at least a little difficult to find them. – OrangeDog Jan 18 '11 at 09:49
  • @Felix - Another highly likely one: ` – OrangeDog Jan 18 '11 at 09:50
  • @OrangeDog Don’t say “can’t”; say “seldom should”. Sometimes they’re ok, but most people don’t think about [all the contingencies](http://stackoverflow.com/questions/4261209/turning-a-input-type-radio-into-a-button-with-regex-c/4261912#4261912), so getting it right is [remarkably difficult in the general case](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). – tchrist Jan 18 '11 at 13:30
  • @OrangeDog: Well, yes; even vi has a search and replace function. I even use it from time to time. I prefer the versions that allow at least EREs w/o all the backslashes, and like those that allow Perl REs even better. But any kind of `s/pattern/replacement/` simplicity applied to HTML is fraught with peril. Compare the naïve approach with the more general one in [this answer](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). The 1st is as far as I’d use an editor for, but the 2nd is needed to handle your examples correctly. – tchrist Jan 18 '11 at 14:16
  • @tchrist - There is no way to correctly handle matched token pairs in standard RE implementations: hence "can't". Someone once showed me an RE with recursion, but I don't know of any engines that support it, and it doesn't sound like a good idea. – OrangeDog Jan 18 '11 at 19:04
  • @tchrist - Note comment #2. I was always against using a RE. – OrangeDog Jan 18 '11 at 19:05
  • @OrangeDog: There is no such thing as ‘a standard RE implementation’, you know. Any PCRE-based regex engine will not be troubled by parsing out nested data structures, as plainly demonstrated [here](http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386) and [here](http://stackoverflow.com/questions/3903965/regex-required-it-should-match-for-following-patterns/3910923#3910923). That said, the best use of regexes is not as a full parser but to grab individual pieces to later assemble using a parser. That is, use it for lexing not parsing. – tchrist Jan 18 '11 at 19:43
  • @tchrist - Oh. Last time I was attempting recursive patterns with PCRE it complained on unknown syntax. And you don't have to keep telling me not to use them to parse html. – OrangeDog Jan 18 '11 at 20:46
  • @OrangeDog: Yeah, I know. Somebody just downvoted me again for my saying not to use regexes for HTML, but then again neglected to leave a comment about why they think I'm wrong and that it must be a good idea. Very annoying. – tchrist Jan 18 '11 at 20:51
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Mark Elliot Jan 23 '11 at 04:12

3 Answers3

3

Please use an html parser instead. You haven't specified a language, but here's a demonstration using BeautifulSoup in Python:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<a href="mailto:abc@abc.com">bla bla bla</a>')
>>> for a in soup.findAll('a'):
...     a['rel'] = 'email'
... 
>>> soup.prettify()
'<a href="mailto:abc@abc.com" rel="email">\n bla bla bla\n</a>'
moinudin
  • 134,091
  • 45
  • 190
  • 216
  • 1
    Since beautifulsoup is no longer in development, you might consider lxml (http://codespeak.net/lxml/lxmlhtml.html) instead. – Seth Johnson Jan 17 '11 at 23:32
  • 1
    This is totally irrelevant to OP's problem. As I understand the question, s/he wants to replace strings in HTML documents with other similar strings. That's a task for the search&replace function of his/her editor. – Felix Dombek Jan 17 '11 at 23:43
  • OP hasn't listed the language yet, from the question, it would most likely be JS. – Orbling Jan 17 '11 at 23:43
  • 1
    How can you see that? I find the question totally vague – Felix Dombek Jan 17 '11 at 23:49
  • @Felix If that ends up being the case, then this is such a terrible question for not mentioning the IDE. :) – moinudin Jan 17 '11 at 23:51
  • Well, yes, one thing is certain, OP will get no helpful answer without giving more information (if not, by chance, one of you happened to be right -- but I doubt it.) >:-> – Felix Dombek Jan 17 '11 at 23:58
  • OK, then this example is not so bad after all, but I posted an easier answer which fits your question if you know exactly what kind of format you're dealing with. – Felix Dombek Jan 18 '11 at 01:10
  • 1
    @Taha, I added that to your question. – Dour High Arch Jan 18 '11 at 01:12
  • @Dour ... I am still evaluating this – Taha Jan 18 '11 at 10:04
  • 1
    @Taha, please use an HTML parser as @marcog suggests; HTML is not a regular language and cannot be parsed as a regular expression. You can create individual expressions that parse individual examples, but this can never work in the general case. Python, C#, VB.Net all come with HTML parsers. Use them. – Dour High Arch Jan 18 '11 at 18:11
0

you may have a look here: http://reflexxion.de/2010/11/e-mail-adresse-gueltig/

/^([a-zA-Z0-9\.\_\-]+)@([a-zA-Z0-9\.\-]+\.[A-Za-z]{2,4})$/
CoolBeans
  • 20,654
  • 10
  • 86
  • 101
Ronald
  • 11
  • 3
  • 2
    This is a very naive email address matcher and does not appear to accomplish what Taha is looking for. – Steven Jan 17 '11 at 23:29
0

Look here: http://msdn.microsoft.com/en-us/library/ms972966.aspx#regexnet_topic13 .. so just do

input = Regex.Replace(input, "<a href=\"mailto:(?<mailaddress>[^\"]*)\">(?<linktext>[^<]*)</a>", "<a rel=\"email\" href=\"mailto:${mailaddress}\">${linktext}</a>"); 

or something along these lines ...

Felix Dombek
  • 13,664
  • 17
  • 79
  • 131