2

Can I 'ignore' query string variables before pulling matching objects from the cache, but not actually remove them from the URL to the end-user?

For example, all the marketing utm_source, utm_campaign, utm_* values don't change the content of the page, they just vary a lot from campaign to campaign and are used by all of our client-side tracking.

So this also means that the URL can't change on the client side, but it should somehow be 'normalized' in the cache.

Essentially I want all of these...

http://example.com/page/?utm_source=google

http://example.com/page/?utm_source=facebook&utm_content=123

http://example.com/page/?utm_campaign=usa

... to all access HIT the cache for http://example.com/page/

However, this URL would cause a MISS (because the param is not a utm_* param)

http://example.com/page/?utm_source=google&variation=5

Would trigger the cache for

http://example.com/page/?variation=5

Also, keeping in mind that the URL the user sees must remain the same, I can't redirect to something without params or any kind of solution like that.

Tallboy
  • 12,847
  • 13
  • 82
  • 173

2 Answers2

3

So I'll add a disclaimer that this regex probably isn't perfect, but it should work pretty well:

sub vcl_recv {  
  set req.url = regsuball(req.url, "\?(utm_[^=&]*=[^&=]*&?)+", "?");
  set req.url = regsuball(req.url, "&(utm_[^=&]*=[^&=]*(&|$))+", "\2");
  set req.url = regsub(req.url, "\?$", "");

  return (pass);
}

This should remove any query parameters starting with utm_. I used three regexs to make it clearer and easier to read.

The first regsuball removes any utm_ parameters at the beginning of the query string. It looks for one or more utm_ parameters immediately after the ?. The second regsuball removes any utm_ parameters that aren't at the beginning of the query string.

The third regex will cleanup the URL by removing the ? if there are no query parameters left after we are done removing utm_ parameters.

Both regexes need to be in ()+ as this will match one or more consecutive utm_ parameters (they wouldn't be matched otherwise).

Example results:

Source URL: /?utm_track=1&utm_test2=hey&test=utm_blah&utm_source=google&variation=5&utm_query=abc&utm_test7=yes
Maps to:    /?test=utm_blah&variation=5

Source URL: /?variation=5&utm_test1=abc&utm_test2=def&blah=1
Maps to:    /?variation=5&blah=1
Brandon
  • 16,382
  • 12
  • 55
  • 88
  • This `[^&=]` won't handle `utm_` parameters that have `=` signs in them (imagine `?utm_track=foo=bar`, I believe. = signs are legal in the query portion of URIs even though web forms usually escape them. Someone (or an ad network) forging a UTM laden URI might not. I would replace `[^&=]` with `[^&]` since you want to remove the entire query param up until the next query param. – mogsie Dec 01 '15 at 09:24
  • The Varnish documentation [recommends against](https://book.varnish-software.com/3.0/VCL_Basics.html#default-vcl-recv) `pass`ing out of a user-defined `vcl-recv` function. Given the request lifecycle, the `pass` looks like it will also skip cache lookup--why not overwrite `req.url` and let the lookup continue like normal without returning? – jeebay May 16 '22 at 17:07
1

This did the trick... it's not perfect according to my own question though as it ignores ALL query params, not just utm ones. When I need to actually implement a non-utm value which changes the content I will need to revisit this regex:

sub vcl_recv {
    set req.url = regsub(req.url, "\?.*", "");
}
Tallboy
  • 12,847
  • 13
  • 82
  • 173