Javascript Regex - unexpected behaviour on faking lookbehind

Question

I am trying to code a widget that collates Tweets from multiple sources as an exercise (something similar exists here, but a) the list option offered there did not load any of my lists, and b) it is a useful learning exercise!). As part of this, I wanted to write a regex which replaces a Twitter handle ('@' followed by characters) with a link to the user's Twitter page. However, I did not want false positives for, for instance, an email address in a tweet.

So, for instance, the replacement should send

Hey there @twitteruser, my email address is address@gmail.com

to

Hey there <a href="http://twitter.com/twitteruser">@twitteruser</a>, my email address is address@gmail.com

Guided by this question, which suggested that I needed some way of replicating negative look-behinds in Javascript, I wrote the following code:

tweetText = tweetText.replace(/(\S)?@([^\s,.;:]*)/ig, function($0, $1){
    return $1 ? $0 + '@' + $1 : '<a href="http://www.twitter.com/' + $0 + '">@' + $0 + '</a>'
});

However, in the cases where the final part of the ternary operator is triggered, $0 contains the '@' symbol. This was unexpected for me - since the '@' was not enclosed in parentheses, I expected $0 to match '([^\s,.;:]*)' - that is, the username of the Twitter user (after, and without, the '@'). I can get the desired functionality by using $0.substring(1), but I would like to further my understanding.

Could someone please point out what I have misunderstood? I am quite new to Regexs, and have never written them in Javascript, nor have I ever used negative look-behinds.

`$0` is always the **whole** pattern. `$1` is the first parenthesized group, ``$2` is the second, etc. — Pointy, Jun 30 '12 at 23:36

score 3 · Accepted Answer · answered Jul 01 '12 at 00:02

3

In any case, instead of trying to match an optional non-space before the @, and rejecting the match if you find one, why not just require a space (or the beginning of the string) before the @?

tweetText = tweetText.replace(
    /(^|\s)@([^\s,.;:]*)/g,
    '$1<a href="http://www.twitter.com/$2">@$2</a>'
);

Not only is this simpler, but it's likely to be quite a bit faster too, since the regexp needs to consider much fewer potential matches.

answered Jul 01 '12 at 00:02

Ilmari Karonen

49,047
9
93
153

Perfect, thank you - I considered doing this at first, but didn't realise that you could mix special characters (like the caret) in with normal patterns. This works, thank you! – scubbo Jul 01 '12 at 00:12

Mitya · Answer 2 · 2012-07-01T00:04:12.833

As is standard behaviour in most REGEX implementations, match zero is the whole match (including, as part of it, any sub-matches - even any that are marked as non-capturing), then any subsequent matches are the captured sub-matches. Check out www.regular-expressions.info. For example:

console.log('hello, there'.match(/\w+(?:,) ?(\w+)/));

Gives you the array

["hello, there", "there"] //the first sub-match is non-capturing

JavaScript does not support look-behinds but there are simulations for this, none perfect, like the one I wrote. JavaScript's REGEXP implementation in general is weaker than that of some other languages. Some examples of omissions include:

look-behinds
named atomic groups
most of the modifiers (though the key ones are there - global, case-insensitive and multi-line)
crucially, the ability to capture sub-groups whilst also matching globally

Thanks for this - I didn't realise that $0 returned the whole match, that explains a lot! — scubbo, Jul 01 '12 at 00:12

score 2 · Answer 3 · answered Jun 30 '12 at 23:52

2

I think you might be complicating things too much. Try this to retrieve the usernames and then make your own helper function to create the markup.

var getTwitter = function (str) {
  var re = /[^\w](@\w+)/g,
      matches = [],
      tweets = []
  while (matches = re.exec(str))
    tweets.push(matches[1])
  return tweets
}

Demo: http://jsfiddle.net/elclanrs/gLvX4/

answered Jun 30 '12 at 23:52

elclanrs

92,861
21
134
171

Thanks for this - the reason that I felt I needed negative lookbehinds was that sometimes (in fact, often) the username tag will occur at the beginning of the text, which this example doesn't catch (though, to be fair, I didn't include that in my example). If there were some way of matching 'a non-word character or the beginning of the string', that would be perfect, but I don't think /[^\w|^]/ will behave as I hope. – scubbo Jul 01 '12 at 00:01
1

@scubbo: There is: `(^|\W)`. See my answer for a full example. – Ilmari Karonen Jul 01 '12 at 00:05

score 0 · Answer 4 · answered Jan 20 '13 at 03:06

You're overcomplicating, is not that complicated. You can do everything once on a single line of code, just do this \W@(\w+)

Live demo http://jsfiddle.net/Victornpb/Wugvd/

//make twitter username links
function linkTwitterNames(elm){
    elm.innerHTML = elm.innerHTML.replace(/\W@(\w+)/g, ' <a class="twitter" href="http://twitter.com/$1" target="_blank">@$1</a>');
}

Javascript Regex - unexpected behaviour on faking lookbehind

4 Answers4