12

My goal is to "whitelist" certain querystring attributes and their values so varnish will not vary cache between the urls.

Example:

Url 1: http://foo.com/someproduct.html?utm_code=google&type=hello  
Url 2: http://foo.com/someproduct.html?utm_code=yahoo&type=hello  
Url 3: http://foo.com/someproduct.html?utm_code=yahoo&type=goodbye

In the above example I want to whitelist "utm_code" but not "type" So after the first url is hit I want varnish to serve that cached content to the second url.

However, in the case of the third url, the attribute "type" value is different so that should be a varnish cache miss.

I have tried the 2 methods below (found on a drupal help article I can't locate right now) that did not seem to work. Might be because I have the regex wrong.

# 1. strip out certain querystring values so varnish does not vary cache.
set req.url = regsuball(req.url, "([\?|&])utm_(campaign|content|medium|source|term)=[^&\s]*&?", "\1");
# get rid of trailing & or ?
set req.url = regsuball(req.url, "[\?|&]+$", "");

# 2. strip out certain querystring values so varnish does not vary cache.
set req.url = regsuball(req.url, "([\?|&])utm_campaign=[^&\s]*&?", "\1");
set req.url = regsuball(req.url, "([\?|&])foo_bar=[^&\s]*&?", "\1");
set req.url = regsuball(req.url, "([\?|&])bar_baz=[^&\s]*&?", "\1");
# get rid of trailing & or ?
set req.url = regsuball(req.url, "[\?|&]+$", "");
Warren P
  • 65,725
  • 40
  • 181
  • 316
runamok
  • 920
  • 1
  • 10
  • 24

7 Answers7

12

I figured this out and wanted to share. I found this code that makes a subroutine that does what I need.

sub vcl_recv {

    # strip out certain querystring params that varnish should not vary cache by
    call normalize_req_url;

    # snip a bunch of other code
}

sub normalize_req_url {

    # Strip out Google Analytics campaign variables. They are only needed
    # by the javascript running on the page
    # utm_source, utm_medium, utm_campaign, gclid, ...
    if(req.url ~ "(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
        set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=[%.-_A-z0-9]+&?", "");
    }
    set req.url = regsub(req.url, "(\?&?)$", "");
}
runamok
  • 920
  • 1
  • 10
  • 24
3

There's something wrong with the RegEx.
I changed the RegExes used in both regsub calls:

sub normalize_req_url {
    # Clean up root URL
    if (req.url ~ "^/(?:\?.*)?$") {
        set req.url = "/";
    }

    # Strip out Google Analytics campaign variables
    # They are only needed by the javascript running on the page
    # utm_source, utm_medium, utm_campaign, gclid, ...
    if (req.url ~ "(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
        set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=[%\._A-z0-9-]+&?", "");
    }
    set req.url = regsub(req.url, "(\?&|\?|&)$", "");
}

The first change is the part "[%._A-z0-9-]", because the dash functioned like a range symbol, that's why I've moved it to the end, and the dot should be escaped.

The second change is to not only remove a question mark at the remaining URL, but also an ampersand or question mark and ampersand.

kipusoep
  • 2,174
  • 6
  • 24
  • 34
1

From https://github.com/mattiasgeniar/varnish-4.0-configuration-templates:

# Some generic URL manipulation, useful for all templates that follow
# First remove the Google Analytics added parameters, useless for our backend
if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
  set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
  set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
  set req.url = regsub(req.url, "\?&", "?");
  set req.url = regsub(req.url, "\?$", "");
}
Pere
  • 1,647
  • 3
  • 27
  • 52
  • For the older 3.x version, the syntax is the same: https://github.com/mattiasgeniar/varnish-3.0-configuration-templates/blob/master/production.vcl#L70 – Pere Dec 03 '15 at 08:39
0

You want to strip out utm_code but it's not covered by either of the regexps you are using.

Try this:

# Strip out specific utm_ values from request URL query parameters
set req.url = regsuball(req.url, "([\?|&])utm_(campaign|content|medium|source|term|code)=[^&\s]*&?", "\1");
# get rid of trailing & or ?
set req.url = regsuball(req.url, "[\?|&]+$", "");

Or if you want to strip all URL parameters that start with utm_ you can go with:

# Strip out ALL utm_ values from request URL query parameters
set req.url = regsuball(req.url, "([\?|&])utm_(\w+)=[^&\s]*&?", "\1");
# get rid of trailing & or ?
set req.url = regsuball(req.url, "[\?|&]+$", "");
Ketola
  • 2,767
  • 18
  • 21
  • I'm sorry, I meant to explain that my code did not seem to work for utm_campaign, utm_content, etc. utm_code was just a "generic example" I made up. I did eventually find something that worked though and will add it to the original edit... Thanks for your input though! – runamok Dec 14 '12 at 02:14
  • Actually you almost had it. But it fails for when you have trailling utm_ since the greedy & on the end matches causes the next on to not match. need: ([\?|&])utm_(\w+)=[^&\s]* – dalore Apr 23 '15 at 14:55
0

A copy of runamok but i got + instead of %20 in my params so i have added that to my regex

sub vcl_recv {
    # strip out certain querystring params that varnish should not vary cache by
    call normalize_req_url;
    # snip a bunch of other code
}
sub normalize_req_url {
    # Strip out Google Analytics campaign variables.
    # I allso stribe facebook local that are use for facebook javascript.
    # They are only neededby the javascript running on the page
    # utm_source, utm_medium, utm_campaign, gclid, ...
    if(req.url ~ "(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=") {
        set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=[%.+-_A-z0-9]+&?", "");
    }
    set req.url = regsub(req.url, "(\?&?)$", "");
}
Evaldnet
  • 1
  • 1
0

Have you guys given this a try? https://github.com/Dridi/libvmod-querystring

Example
set req.url = querystring.regfilter(req.url, "utm_.*");

user2965205
  • 141
  • 2
  • 7
0

I improved upon runamok's answer a bit by adding support for empty params and sorting the remaining ones, here's a full vtc file that I implemented to validate correctness.

varnishtest "Test for URL normalization - Varnish 4"

server s1 {
  rxreq
  txresp -hdr "Backend: up" -body "Some content"
} -repeat 11 -start

varnish v1 -vcl+backend {
  import std;

  sub vcl_recv {
    # Strip out marketing variables. They are only needed by
    # the javascript running on the page.
    if (req.url ~ "(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)(=|&|$)") {
      # Process params with value.
      set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=[%.\-_A-z0-9]+&?", "");
      # Process params without value.
      set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=?(&|$)", "");
    }
    # Remove trailing '?', '?&'
    set req.url = regsub(req.url, "(\?&?)$", "");
    # Sort query params, also removes trailing '&'
    set req.url = std.querysort(req.url);
  }

  sub vcl_deliver {
    set resp.http.X-Normalized-URL = req.url;
  }
} -start

client c1 {
  # Basic, no params.
  txreq -url "/test/some-url"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # One blacklisted param.
  txreq -url "/test/some-url?utm_campaign=1"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # One blacklisted param, without value.
  txreq -url "/test/some-url?utm_campaign"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # Two blacklisted params.
  txreq -url "/test/some-url?utm_campaign=1&origin=hpg"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # Two blacklisted params, one without value
  txreq -url "/test/some-url?utm_campaign&origin=123-abc%20"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # Two blacklisted params, both without value
  txreq -url "/test/some-url?utm_campaign&origin="
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # Three blacklisted params.
  txreq -url "/test/some-url?utm_campaign=ABC&origin=hpg&siteurl=br2"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # Three blacklisted params, two without value
  txreq -url "/test/some-url?utm_campaign=1&origin=&siteurl"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url"

  # Three blacklisted params; one param to keep, with space encoded as +.
  txreq -url "/test/some-url?qss=hello+one&utm_campaign=some-value&origin=hpg&siteurl=br2"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url?qss=hello+one"

  # Three blacklisted params; one param to keep, with space encoded as %20, passed in-between blacklisted ones.
  txreq -url "/test/some-url?utm_campaign=1&qss=hello%20one&origin=hpg&siteurl=br2"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url?qss=hello%20one"

  # Three blacklisted params; three params to keep.
  txreq -url "/test/some-url?utm_campaign=a-value&qss=hello+one&origin=hpg&siteurl=br2&keep2=abc&keep1"
  rxresp
  expect resp.http.X-Normalized-URL == "/test/some-url?keep1&keep2=abc&qss=hello+one"
} -run

varnish v1 -expect client_req == 11
Jedihe
  • 101
  • 2