Regular expression to match
Asked Jan 17 '11 at 23:17

Active Sep 27 '11 at 04:04

Viewed 2,548 times

Question

I am trying to match all

<a href="mailto:abc@abc.com">bla bla bla</a>

and I have another filter that will append

<a rel="email" href="mailto:abc@abc.com">bla bla bla</a>

So I am looking for the regular expression that will find that with the replace function.

What language are you using and what flavour of regex does it come with? — Andy E, Jan 17 '11 at 23:19
No. HTML is not a regular language, so regular expressions are not the tool to use. You should use a parser instead. A streaming parser (e.g. SAX) will solve this problem with maximum efficiency. — OrangeDog, Jan 17 '11 at 23:23
@OrangeDog: PCRE regexp do not require a language to be regular in order to do some fairly complex stuff with. The comment only applies if you are trying to parse some nested construct generally. Something simple like this should not be a particularly tall order. — Orbling, Jan 17 '11 at 23:26
In your case, it will probably be enough to replace `$2` where you add the `rel` attribute in the "replace with" field, and consult your program manual on what placeholder to use instead of `$n` (look for "capture", "group" or "label", that's what these things are called …) — Felix Dombek, Jan 17 '11 at 23:29
@Orbling What when you get something like `bc@abc.com">bla bla bla` ? — moinudin, Jan 17 '11 at 23:30
@OrangeDog: Orbling is completely right, OP didn't say anything about parsing. S/he just wants to manipulate strings. Any modern flavour of regexes allows exactly what s/he wants. — Felix Dombek, Jan 17 '11 at 23:34
@marcog: c'mon, how many email addresses with `"` in them have you seen? But anyway, my idea would still work with that -- `$1 == mailto:a\, $2 == bc@abc.com">bla bla bla` — Felix Dombek, Jan 17 '11 at 23:40
@marcog: I don't believe speech marks " are valid in email addresses. But even if they were, you can tell it to match only a " without an escape. In this example that is not necessary anyhow. — Orbling, Jan 17 '11 at 23:41
@Felix It's still valid html. There are far more reasons though: What if `rel` and `href` are the other way around? Additional attributes. Single quotes or no quotes? The `` tag quoted? Lots of things can go wrong when parsing html with a regex. — moinudin, Jan 17 '11 at 23:45
@amarcog, I've just checked the spec and only `'` is allowed within email addresses, not `"`, except in an extremely rare square-bracketed unicode form which is deprecated in the standard. However, on the topic: If OP knows what s/he has written, then it's no problem to find a regex which handles exactly that. Also, modern flavours of regexes are strictly more powerful than regular languages. I'm doing this stuff with regexes all the time and it is usually the easiest thing — Felix Dombek, Jan 17 '11 at 23:55
@Felix - If the OP had written the HTML to start with then they would (hopefully) just use Find/Replace in their IDE. One assumes that they are actually processing 3rd-party HTML, which could be of any form. If you care to post a regex you would suggest, I could find at least two valid cases that it would not work for. — OrangeDog, Jan 18 '11 at 00:21
Fair enough. Regex for Microsoft Expression Web: search field `([^<]*)` and replace field `\2` and I'm aware that no ` " ` s are allowed in the email address and no other tags inside the link, if you just want to prohibit other `a` tags then it is considerably more difficult but I could do it (regular languages are closed under complement, therefore, it is possible). It is probably much less of a hassle than to learn a completely new API and write a whole executable program for it. — Felix Dombek, Jan 18 '11 at 00:41
@OrangeDog: Not even POSIX-standard regexes are ʀᴇɢᴜʟᴀʀ you know. So what? And plenty of folks don’t write HTML using IDE video games, either. — tchrist, Jan 18 '11 at 01:02
@tchrist - Yes I know that, but they still can't parse HTML. Also, unless you're still programming on punch cards, you're going to have access to a Find/Replace function. Even vi has one. — OrangeDog, Jan 18 '11 at 09:46
@Felix - `bla bla bla` and `bla bla bla`. I thought you could have made it at least a little difficult to find them. — OrangeDog, Jan 18 '11 at 09:49
@OrangeDog Don’t say “can’t”; say “seldom should”. Sometimes they’re ok, but most people don’t think about [all the contingencies](http://stackoverflow.com/questions/4261209/turning-a-input-type-radio-into-a-button-with-regex-c/4261912#4261912), so getting it right is [remarkably difficult in the general case](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). — tchrist, Jan 18 '11 at 13:30
@OrangeDog: Well, yes; even vi has a search and replace function. I even use it from time to time. I prefer the versions that allow at least EREs w/o all the backslashes, and like those that allow Perl REs even better. But any kind of `s/pattern/replacement/` simplicity applied to HTML is fraught with peril. Compare the naïve approach with the more general one in [this answer](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). The 1st is as far as I’d use an editor for, but the 2nd is needed to handle your examples correctly. — tchrist, Jan 18 '11 at 14:16
@tchrist - There is no way to correctly handle matched token pairs in standard RE implementations: hence "can't". Someone once showed me an RE with recursion, but I don't know of any engines that support it, and it doesn't sound like a good idea. — OrangeDog, Jan 18 '11 at 19:04
@tchrist - Note comment #2. I was always against using a RE. — OrangeDog, Jan 18 '11 at 19:05
@OrangeDog: There is no such thing as ‘a standard RE implementation’, you know. Any PCRE-based regex engine will not be troubled by parsing out nested data structures, as plainly demonstrated [here](http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386) and [here](http://stackoverflow.com/questions/3903965/regex-required-it-should-match-for-following-patterns/3910923#3910923). That said, the best use of regexes is not as a full parser but to grab individual pieces to later assemble using a parser. That is, use it for lexing not parsing. — tchrist, Jan 18 '11 at 19:43
@tchrist - Oh. Last time I was attempting recursive patterns with PCRE it complained on unknown syntax. And you don't have to keep telling me not to use them to parse html. — OrangeDog, Jan 18 '11 at 20:46
@OrangeDog: Yeah, I know. Somebody just downvoted me again for my saying not to use regexes for HTML, but then again neglected to leave a comment about why they think I'm wrong and that it must be a good idea. Very annoying. — tchrist, Jan 18 '11 at 20:51
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Mark Elliot, Jan 23 '11 at 04:12

score 3 · Answer 1 · answered Jan 17 '11 at 23:28

3

Please use an html parser instead. You haven't specified a language, but here's a demonstration using BeautifulSoup in Python:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<a href="mailto:abc@abc.com">bla bla bla</a>')
>>> for a in soup.findAll('a'):
...     a['rel'] = 'email'
... 
>>> soup.prettify()
'<a href="mailto:abc@abc.com" rel="email">\n bla bla bla\n</a>'

answered Jan 17 '11 at 23:28

moinudin

134,091
45
190
216

1

Since beautifulsoup is no longer in development, you might consider lxml (http://codespeak.net/lxml/lxmlhtml.html) instead. – Seth Johnson Jan 17 '11 at 23:32
1

This is totally irrelevant to OP's problem. As I understand the question, s/he wants to replace strings in HTML documents with other similar strings. That's a task for the search&replace function of his/her editor. – Felix Dombek Jan 17 '11 at 23:43
OP hasn't listed the language yet, from the question, it would most likely be JS. – Orbling Jan 17 '11 at 23:43
1

How can you see that? I find the question totally vague – Felix Dombek Jan 17 '11 at 23:49
@Felix If that ends up being the case, then this is such a terrible question for not mentioning the IDE. :) – moinudin Jan 17 '11 at 23:51
Well, yes, one thing is certain, OP will get no helpful answer without giving more information (if not, by chance, one of you happened to be right -- but I doubt it.) >:-> – Felix Dombek Jan 17 '11 at 23:58
OK, then this example is not so bad after all, but I posted an easier answer which fits your question if you know exactly what kind of format you're dealing with. – Felix Dombek Jan 18 '11 at 01:10
1

@Taha, I added that to your question. – Dour High Arch Jan 18 '11 at 01:12
@Dour ... I am still evaluating this – Taha Jan 18 '11 at 10:04
1

@Taha, please use an HTML parser as @marcog suggests; HTML is not a regular language and cannot be parsed as a regular expression. You can create individual expressions that parse individual examples, but this can never work in the general case. Python, C#, VB.Net all come with HTML parsers. Use them. – Dour High Arch Jan 18 '11 at 18:11

score 0 · Answer 2 · edited Sep 27 '11 at 04:04

0

you may have a look here: http://reflexxion.de/2010/11/e-mail-adresse-gueltig/

/^([a-zA-Z0-9\.\_\-]+)@([a-zA-Z0-9\.\-]+\.[A-Za-z]{2,4})$/

edited Sep 27 '11 at 04:04

CoolBeans

20,654
10
86
101

answered Jan 17 '11 at 23:26

Ronald

11
3

2

This is a very naive email address matcher and does not appear to accomplish what Taha is looking for. – Steven Jan 17 '11 at 23:29

score 0 · Answer 3 · answered Jan 18 '11 at 01:06

0

Look here: http://msdn.microsoft.com/en-us/library/ms972966.aspx#regexnet_topic13 .. so just do

input = Regex.Replace(input, "<a href=\"mailto:(?<mailaddress>[^\"]*)\">(?<linktext>[^<]*)</a>", "<a rel=\"email\" href=\"mailto:${mailaddress}\">${linktext}</a>");

or something along these lines ...

answered Jan 18 '11 at 01:06

Felix Dombek

13,664
17
79
131

1

That only works if there is exactly one space between `a` and `href`, and HTML allows any number of different things. – Dour High Arch Jan 18 '11 at 01:20
Then use `(\s+)` in the place of spaces – Felix Dombek Jan 18 '11 at 01:24
3

` – Dour High Arch Jan 18 '11 at 01:45
But you're changing a question about a tiny subset of all the possibilities into the complete parsing task. That was not the question. Still then, where does `href="mailto:___"` occur if not in links? If it is known that no link already has the `rel` attribute, then it would be enough to replace `href="mailto:` with `rel="email" href="mailto:`. Parsing a tree (or even just SAX / any *structural* analysis) for such a little task is pure overkill. But obviously there are proponents of both ways here. I'd stick to the smallest solution that does the task as needed. – Felix Dombek Jan 18 '11 at 01:51
@Felix If we were to interpret the question literally, then `s/bla bla bla<\/a>/bla bla bla<\/a>/` would be the right answer. You can't argue what was the question when it's ill-defined. You were already wrong about this being an editor question. – moinudin Jan 18 '11 at 02:30
What I wrote here applies nevertheless -- it needs a few previous assumptions about the code, but if one can be sure that they are justified, one can solve this much easier than parsing -- essentially, a one-liner. And I'm not even talking about the overhead in memory usage that parsing incurs. I realize your answer is more general, and it's probably a real good way of parsing HTML, but my answer isn't wrong at all for the problem as it's stated. – Felix Dombek Jan 18 '11 at 17:10
@Felix - There is no greater memory usage than a RE if you use a streaming parser. – OrangeDog Jan 18 '11 at 19:06
OK, but BeautifulSoup is not a streaming parser. Streaming parsing also has an overhead in programming time compared to one RE check. But whatever, my point wasn't the overhead – Felix Dombek Jan 18 '11 at 19:31
1

@Felix - I could learn the SAX api and implement this faster than I could work out a regular expression that covers even half of the common cases. – OrangeDog Jan 18 '11 at 20:42

Regular expression to match
Asked Jan 17 '11 at 23:17

Active Sep 27 '11 at 04:04

Viewed 2,548 times

3 Answers3

Linked

Regular expression to match Asked Jan 17 '11 at 23:17 Active Sep 27 '11 at 04:04 Viewed 2,548 times

3 Answers3

Linked

Regular expression to match
Asked Jan 17 '11 at 23:17

Active Sep 27 '11 at 04:04

Viewed 2,548 times