3

I am trying to code a widget that collates Tweets from multiple sources as an exercise (something similar exists here, but a) the list option offered there did not load any of my lists, and b) it is a useful learning exercise!). As part of this, I wanted to write a regex which replaces a Twitter handle ('@' followed by characters) with a link to the user's Twitter page. However, I did not want false positives for, for instance, an email address in a tweet.

So, for instance, the replacement should send

Hey there @twitteruser, my email address is address@gmail.com

to

Hey there <a href="http://twitter.com/twitteruser">@twitteruser</a>, my email address is address@gmail.com

Guided by this question, which suggested that I needed some way of replicating negative look-behinds in Javascript, I wrote the following code:

tweetText = tweetText.replace(/(\S)?@([^\s,.;:]*)/ig, function($0, $1){
    return $1 ? $0 + '@' + $1 : '<a href="http://www.twitter.com/' + $0 + '">@' + $0 + '</a>'
});

However, in the cases where the final part of the ternary operator is triggered, $0 contains the '@' symbol. This was unexpected for me - since the '@' was not enclosed in parentheses, I expected $0 to match '([^\s,.;:]*)' - that is, the username of the Twitter user (after, and without, the '@'). I can get the desired functionality by using $0.substring(1), but I would like to further my understanding.

Could someone please point out what I have misunderstood? I am quite new to Regexs, and have never written them in Javascript, nor have I ever used negative look-behinds.

Community
  • 1
  • 1
scubbo
  • 4,969
  • 7
  • 40
  • 71
  • `$0` is always the **whole** pattern. `$1` is the first parenthesized group, ``$2` is the second, etc. – Pointy Jun 30 '12 at 23:36

4 Answers4

3

In any case, instead of trying to match an optional non-space before the @, and rejecting the match if you find one, why not just require a space (or the beginning of the string) before the @?

tweetText = tweetText.replace(
    /(^|\s)@([^\s,.;:]*)/g,
    '$1<a href="http://www.twitter.com/$2">@$2</a>'
);

Not only is this simpler, but it's likely to be quite a bit faster too, since the regexp needs to consider much fewer potential matches.

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
  • Perfect, thank you - I considered doing this at first, but didn't realise that you could mix special characters (like the caret) in with normal patterns. This works, thank you! – scubbo Jul 01 '12 at 00:12
2

As is standard behaviour in most REGEX implementations, match zero is the whole match (including, as part of it, any sub-matches - even any that are marked as non-capturing), then any subsequent matches are the captured sub-matches. Check out www.regular-expressions.info. For example:

console.log('hello, there'.match(/\w+(?:,) ?(\w+)/));

Gives you the array

["hello, there", "there"] //the first sub-match is non-capturing

JavaScript does not support look-behinds but there are simulations for this, none perfect, like the one I wrote. JavaScript's REGEXP implementation in general is weaker than that of some other languages. Some examples of omissions include:

  • look-behinds
  • named atomic groups
  • most of the modifiers (though the key ones are there - global, case-insensitive and multi-line)
  • crucially, the ability to capture sub-groups whilst also matching globally
Mitya
  • 33,629
  • 9
  • 60
  • 107
2

I think you might be complicating things too much. Try this to retrieve the usernames and then make your own helper function to create the markup.

var getTwitter = function (str) {
  var re = /[^\w](@\w+)/g,
      matches = [],
      tweets = []
  while (matches = re.exec(str))
    tweets.push(matches[1])
  return tweets
}

Demo: http://jsfiddle.net/elclanrs/gLvX4/

elclanrs
  • 92,861
  • 21
  • 134
  • 171
  • Thanks for this - the reason that I felt I needed negative lookbehinds was that sometimes (in fact, often) the username tag will occur at the beginning of the text, which this example doesn't catch (though, to be fair, I didn't include that in my example). If there were some way of matching 'a non-word character or the beginning of the string', that would be perfect, but I don't think /[^\w|^]/ will behave as I hope. – scubbo Jul 01 '12 at 00:01
  • 1
    @scubbo: There is: `(^|\W)`. See my answer for a full example. – Ilmari Karonen Jul 01 '12 at 00:05
0

You're overcomplicating, is not that complicated. You can do everything once on a single line of code, just do this \W@(\w+)

Live demo http://jsfiddle.net/Victornpb/Wugvd/

//make twitter username links
function linkTwitterNames(elm){
    elm.innerHTML = elm.innerHTML.replace(/\W@(\w+)/g, ' <a class="twitter" href="http://twitter.com/$1" target="_blank">@$1</a>');
}
Vitim.us
  • 20,746
  • 15
  • 92
  • 109