I typed this a while back but got busy (still am, so I might take a while to reply back) and didn't get around to post it. If you're still open to answers...
Is there some teaching resource that is promoting them?
I don't think so, it's just a coincidence I believe.
But, for example, this query from a recent question has a simple solution to capturing the .*
, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
This is most probably a C# regex, since variable width lookbehinds are not supported my many regex engines. Well, the lookarounds could be certainly avoided here, because for this, I believe it's really simpler to have capture groups (and make the .*
lazy as we're at it):
(<td><a href="\/xxx\.html\?n=[0-9]{0,5}">).*?(<\/a><span)
If it's for a replace, or
<td><a href="\/xxx\.html\?n=[0-9]{0,5}">(.*?)<\/a><span
for a match. Though an html parser would definitely be more advisable here.
Lookarounds in this case I believe are slower. See regex101 demo where the match is 64 steps for capture groups but 94+19 = 1-3 steps for the lookarounds.
When is it truly better to use a positive look-around? Can you give some examples?
Well, lookarounds have the property of being zero-width assertions, which mean they don't really comtribute to matches while they contribute onto deciding what to match and also allows overlapping matches.
Thinking a bit about it, I think, too, that negative lookarounds get used much more often, but that doesn't make positive lookarounds less useful!
Some 'exploits' I can find browsing some old answers of mine (links below will be demos from regex101) follow. When/If you see something you're not familiar about, I probably won't be explaining it here, since the question's focused on positive lookarounds, but you can always look at the demo links I provided where there's a description of the regex, and if you still want some explanation, let me know and I'll try to explain as much as I can.
To get matches between certain characters:
In some matches, positive lookahead make things easier, where a lookahead could do as well, or when it's not so practical to use no lookarounds:
Dog sighed. "I'm no super dog, nor special dog," said Dog, "I'm an ordinary dog, now leave me alone!" Dog pushed him away and made his way to the other dog.
We want to get all the dog
(regardless of case) outside quotes. With a positive lookahead, we can do this:
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*$)
to ensure that there are even number of quotes ahead. With a negative lookahead, it would look like this:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
to ensure that there are no odd number of quotes ahead. Or use something like this if you don't want a lookahead, but you'll have to extract the group 1 matches:
(?:"[^"]+"[^"]+?)?(\bdog\b)
Okay, now say we want the opposite; find 'dog' inside the quotes. The regex with the lookarounds just need to have the sign inversed, first and second:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*$)
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
But without the lookaheads, it's not possible. the closest you can get is maybe this:
"[^"]*(\bdog\b)[^"]*"
But this doesn't get all the matches, or you can maybe use this:
"[^"]*?(\bdog\b)[^"]*?(?:(\bdog\b)[^"]*?)?"
But it's just not practical for more occurrences of dog
and you get the results in variables with increasing numbers... And this is indeed easier with lookarounds, because they are zero width assertions, you don't have to worry about the expression inside the lookaround to match dog
or not, or the regex wouldn't have obtained all the occurrences of dog
in the quotes.
Of course now, this logic can be extended to groups of characters, such as getting specific patterns between words such as start
and end
.
Overlapping matches
If you have a string like:
abcdefghijkl
And want to extract all the consecutive 3 characters possible inside, you can use this:
(?=(...))
If you have something like:
1A Line1 Detail1 Detail2 Detail3 2A Line2 Detail 3A Line3 Detail Detail
And want to extract these, knowing that each line starts with #A Line#
(where #
is a number):
1A Line1 Detail1 Detail2 Detail3
2A Line2 Detail
3A Line3 Detail Detail
You might try this, which fails because of greediness...
[0-9]+A Line[0-9]+(?: \w+)+
Or this, which when made lazy no more works...
[0-9]+A Line[0-9]+(?: \w+)+?
But with a positive lookahead, you get this:
[0-9]+A Line[0-9]+(?: \w+)+?(?= [0-9]+A Line[0-9]+|$)
And appropriately extracts what's needed.
Another possible situation is one where you have something like this:
#ff00fffirstword#445533secondword##008877thi#rdword#
Which you want to convert to three pairs of variables (first of the pair being a # and some hex values (6) and whatever characters after them):
#ff00ff and firstword
#445533 and secondword#
#008877 and thi#rdword#
If there were no hashes inside the 'words', it would have been enough to use (#[0-9a-f]{6})([^#]+)
, but unfortunately, that's not the case and you have to resort to .*?
instead of [^#]+
, which doesn't quite yet solve the issue of stray hashes. Positive lookaheads however make this possible:
(#[0-9a-f]{6})(.+?)(?=#[0-9a-f]{6}|$)
Validation & Formatting
Not recommended, but you can use positive lookaheads for quick validations. The following regex for instance allow the entry of a string containing at least 1 digit and 1 lowercase letter.
^(?=[^0-9]*[0-9])(?=[^a-z]*[a-z])
This can be useful when you're checking for character length but have patterns of varying length in the a string, for example, a 4 character long string with valid formats where #
indicates a digit and the hyphen/dash/minus -
must be in the middle:
##-#
#-##
A regex like this does the trick:
^(?=.{4}$)\d+-\d+
Where otherwise, you'd do ^(?:[0-9]{2}-[0-9]|[0-9]-[0-9]{2})$
and imagine now that the max length was 15; the number of alterations you'd need.
If you want a quick and dirty way to rearrange some dates in the 'messed up' format mmm-yyyy
and yyyy-mm
to a more uniform format mmm-yyyy
, you can use this:
(?=.*(\b\w{3}\b))(?=.*(\b\d{4}\b)).*
Input:
Oct-2013
2013-Oct
Output:
Oct-2013
Oct-2013
An alternative might be to use a regex (normal match) and process separately all the non-conforming formats separately.
Something else I came across on SO was the indian currency format, which was ##,##,###.###
(3 digits to the left of the decimal and all other digits groupped in pair). If you have an input of 122123123456.764244
, you expect 1,22,12,31,23,456.764244
and if you want to use a regex, this one does this:
\G\d{1,2}\K\B(?=(?:\d{2})*\d{3}(?!\d))
(The (?:\G|^)
in the link is only used because \G
matches only at the start of the string and after a match) and I don't think this could work without the positive lookahead, since it looks forward without moving the point of replacement.)
Trimming
Suppose you have:
this is a sentence
And want to trim all the spaces with a single regex. You might be tempted to do a general replace on spaces:
\s+
But this yields thisisasentence
. Well, maybe replace with a single space? It now yields " this is a sentence " (double quotes used because backticks eats spaces). Something you can however do is this:
^\s*|\s$|\s+(?=\s)
Which makes sure to leave one space behind so that you can replace with nothing and get "this is a sentence".
Splitting
Well, somewhere else where positive lookarounds might be useful is where, say you have a string ABC12DE3456FGHI789
and want to get the letters+digits apart, that is you want to get ABC12
, DE3456
and FGHI789
. You can easily do use the regex:
(?<=[0-9])(?=[A-Z])
While if you use ([A-Z]+[0-9]+)
(i.e. the captured groups are put back in the resulting list/array/etc, you will be getting empty elements as well.
Note that this could be done with a match as well, with [A-Z]+[0-9]+
If I had to mention negative lookarounds, this post would have been even longer :)