Google Analytics Regex - Alternative to no negative lookahead

Question

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.

The regex that includes negative lookahead that would work if it was enabled is:

test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

This matches:

test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23 
test.com/?ref=23&e=35

and does not match (as it should):

test.com/ambassadors
test.com/admin/?signup=true 
test.com/randomtext/

I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.

Thank you!

Alan Moore · Accepted Answer · 2012-11-13T18:30:47.040

Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.

That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.

However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:

test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)

...which I'm pretty sure you don't want. :P

Try this regex:

test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$

or more readably:

test\.com
(?:
  /
  (?:index_\w+\.php)?
  (?:
    \?ref=\d+
    (?:
      &e=\d+
    )?
  )?
)?
\s*$

For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:

^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$

thank you very much for your detailed answer, however Google Analytics is showing zero matches for it. I also can't seem to get it working on the online regex checker: http://regexr.com?32pr7 — eiso, Nov 13 '12 at 17:58
In the tester you should either use my first regex or delete the `test\.com` from each of the URLs. You also need to turn on multiline mode and get rid of that space you added to the end of the regex. It still won't match the `test.com/?ref=23` line, because it too has a space at the end. (Is that valid in GA? I suspect not.) — Alan Moore, Nov 13 '12 at 18:29
I got it working! I guess the space at the end was the problem. Thank you very much! You have saved me a lot of time each week with this custom report in GA and I've learned a lot about regexes. — eiso, Nov 14 '12 at 12:41
Given that it's similar and probably something simple that I'm missing in my regex, I wonder if you know how to solve this? https://stackoverflow.com/q/58259878/470749 Thanks. — Ryan, Oct 06 '19 at 20:57

Martin Ender · Answer 2 · 2012-11-13T14:05:08.213

Firstly I think your regex needs some fixing. Let's look at what you have:

test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

The case where you use the optional ? at the start of index... is already taken care of by the second alternative:

test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:

test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:

test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now the first and second and third option can be collapsed into one, if we make the file name optional, too:

test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)

Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)

Seeing your input examples, this seems to be closer to what you actually want to match.

Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.

So, if you use singleline mode (which probably means you have only one URL per string), use this:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z

If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$

Thank you very much for the answer. I have adapted the regex accordingly. And yes, it is singleline mode. Do you have any thoughts on how to change this so it does not need the negative lookahead? — eiso, Nov 13 '12 at 13:54
@eiso I also added in some explanation on how I got to the simplified version (and what my assumptions about your desired matches were) — Martin Ender, Nov 13 '12 at 13:54
Wow! Thank you very much for this, I hadn't seen that yet with my previous reply. — eiso, Nov 13 '12 at 13:55
There's always more to learn with Regular Expressions. Is there an approach that does not use the last part of a negative lookahead ((\s)*$) because Google Analytics does not accept this. — eiso, Nov 13 '12 at 13:58
@eiso `$` is not a negative lookahead. it's simply an anchor, and that should be fine. Try `\Z` instead of `$`. That is actually even better, since you don't need to worry about multiline mode. — Martin Ender, Nov 13 '12 at 13:59
@eiso I made another minor improvement to the regex and added both possible [anchors](http://www.regular-expressions.info/anchors.html) to the answer. — Martin Ender, Nov 13 '12 at 14:05
thank you again for all your help. Unfortunately Google Analytics is not parsing either, also I'm having a hard time getting it to work on a regex tester: http://regexr.com?32pr4 — eiso, Nov 13 '12 at 17:56
@eiso for gskinner, check both `extended` and `multiline` ... unfortunately I have no clue what might not be working with Google Analytics at the moment — Martin Ender, Nov 13 '12 at 18:05

Google Analytics Regex - Alternative to no negative lookahead

2 Answers2

Linked