2

I need to strip this string <a class=BC_ANCHOR href="http://www.msn.com" onClick=something target=_blank>MSN</a> into <a href="http://www.msn.com">MSN</a> - however this Regex \s+\w+[^href]=\S*\w? won't stop at the closing > but rather runs to the end of the </a> - can someone please assist me in getting this Regex to stop at that closing >?

Thanks!

Mike Perrenoud
  • 66,820
  • 29
  • 157
  • 232
  • 2
    That regex looks wrong in lots of ways, e.g., `[^href]` means "match a _single_ character that is anything other than an h, r, e or f". What is the context where that code will run? (Because if you're extracting an element that is on the page already there are much easier ways to go about it.) – nnnnnn Mar 02 '12 at 02:04
  • `[^href]` means any character except `h`, `r`, `e`, or `f`. It doesn't mean not `href`. That would be something like `((?!href\b)[a-z]+)` – Mike Samuel Mar 02 '12 at 02:07
  • 1
    You might need [fancier patterns than that](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491). – tchrist Mar 02 '12 at 02:12
  • 1
    [You can't parse HTML with Regular Expressions.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – ghoti Mar 02 '12 at 02:15
  • 2
    @ghoti [Nonsense! Of course you can!](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string) You just don’t want to — for general stuff. But for simple stuff like this, of course you want to use a regex. It’s what htey’re made for. Stop being an unthinking parrot. Just because you may not be able to figure it out doesn’t mean you should insult others’ intelligence by pretending they can’t figure it out either. – tchrist Mar 02 '12 at 02:26
  • @tchrist, thanks for your pointed reminder that not everyone has a sense of humour. Did you even bother to check where the link lead? – ghoti Mar 02 '12 at 02:53
  • 1
    I use regexps to parse html, but I still thought the link was funny. does that make me a bad person? – Graham Mar 02 '12 at 02:57

4 Answers4

3

By putting \w+[^href] you still allow things like <a href ="... and can exclude tags ending in h, r, e, or f (that aren't necessarily href).

Try

\s+(?!href)[a-zA-Z+]+ *= *(?:"[^"]+"|\w+)

Explanation: The (?!href) is a negative lookahead and prevents the tag from being href.

The [a-zA-Z]+ is your tag. There are spaces allowed before and after the '='. I restricted to letters, because I'm pretty sure attribute names can't include numbers or underscores (which \w will allow).

The (?:"[^"]+"|\w+) means that the value of the tag can be anything within double-quotes, OR a non-quoted set of \w+.

These all prevent the match from going outside the >, unless your regex is malformed and you have (e.g.) <a name="asdf> (note the missing closing ").

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • This is working awesome - except for one scenario that I just found. There is one link that looks like this `MSN` and for some reason `(event)` isn't getting matched. I've tried changing the `\w+` to a `.*` but that selects everything then. Regex, why do you hate me? – Mike Perrenoud Mar 02 '12 at 02:20
  • 2
    Regex doesn't hate you, you just have to learn about greedy and non-greedy. `.*` matches as much as it possibly can (so will go all the way to to the last `>`). To make this non-greedy, i.e. match as *little* as possible, try `.*?`. Or, you could just do `[\w()]+` to allow `\w` and brackets. (Remember that `\w` is `[a-zA-Z0-9_]` (roughly, unsure about locale and accented letters). – mathematical.coffee Mar 02 '12 at 02:41
  • @mathematical.coffee that worked awesome by putting the () in there - it does exactly what I need it to do now, thanks a lot!! – Mike Perrenoud Mar 02 '12 at 18:37
3

Don't try to sanitize HTML using regular expressions. You're more likely than not to get it wrong in ways that have poor security consequences.

There may be DOM solutions to your problem and if not, there are libraries that have been thoroughly tested and reviewed by people who write parsers for a living.

Shameless plug: http://code.google.com/p/google-caja/wiki/JsHtmlSanitizer

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • +1 Apparently the down voter didn't have sufficient reasons to state them. The answer is good advice. – RobG Mar 02 '12 at 02:16
  • Why would you want to sanitize HTML with JS? Why wouldn't you do that server-side if it was needed? – mpen Mar 02 '12 at 03:11
  • 1
    @Mark, If you get HTML from a webservice call, but don't trust the service to run code in your domain, then you have to sanitize it yourself. You can avoid latency by doing it in the client. – Mike Samuel Mar 02 '12 at 04:39
2

If you really want to use a regex my suggestion is to do it the other way around. Extract the href and the link text to groups and then generate the tag again.

href="([^"]+)"[^>]*>([^<]+)<\/a>

Someone mentioned getting the values using the DOM, I also agree that is the best option if you are using JS.

Bruno Silva
  • 3,077
  • 18
  • 20
  • That Regex selected the entire string for some reason - when what I need to do is strip everything out of the `A` tag except for the `href` and render that string. – Mike Perrenoud Mar 02 '12 at 02:08
0

Are you dealing with HTML or DOM elements?

Much easier to deal with elements. If you want the element to have only an href attribute, then why not something like:

function fixLink(el) {
  var newLink = document.createElement('a');
  newLink.href = el.href;
  newLink.appendChild(document.createTextNode(el.textContent || el.innerText));
  el.parentNode.replaceChild(newLink, el);
}

Even if you're dealing HTML, you can insert it into a new element (say a div), do the above, then get the remaining innerHTML.

RobG
  • 142,382
  • 31
  • 172
  • 209
  • I want to explore this. I am parsing links out of a comments field, and the comments are originally displayed to the user in HTML, but when the user wants to edit that comment I need to convert it to text and strip off some of the adornment I add specific to the application. With that in mind is there a better way to do this with the DOM then? – Mike Perrenoud Mar 02 '12 at 18:41