4

PII (Personally Identifiable Information) should never be sent to Google Analytics, not only it breaches GA Terms of Use, but you also leaking sensitive user data. So how to remove PII from URL, such as query string params (email, userId, ...) or even from location path when using Google Tag Manager (GTM) and Google Analytics 4 (GA4)?

oguz ismail
  • 1
  • 16
  • 47
  • 69
petriq
  • 969
  • 1
  • 11
  • 24

1 Answers1

7

Let's assume you've got already set up GA4 property and GTM installed on your page.

So let's create new tag for GA4 configuration. As Measurement ID I use lookup table variable (it's perfect when you've got multiple environments like testing, staging, production - all those have separate Measurement ID, but uses same GTM install script), but you can just simply write your G-XXXXXXXXX Measurement ID here. Then expand Fields to Set section, add page_location as Field Name and click on lego button next to Value. enter image description here

Click on + (plus button) in upper right corner to add new variable. enter image description here

As a Variable Type choose Custom JavaScript. In upper left corner write name of your new variable, I used Redacted Page Location. enter image description here

And now we are getting closer to how to remove PII. In Custom JavaScript section insert JS function which should return redacted URL. Mine function uses regular expressions to replace PII from URL with some redacted text. Parameters I wanted to redact from url path are IDs of company, project, epic, and task; and userId from query params.

function() {
  var url = window.location.toString();
  var filter = [
    {
      rx: /company\/\d+/g,
      replacement: 'company/REDACTED_COMPANY_ID'
    },
    {
      rx: /projects\/\d+/g,
      replacement: 'projects/REDACTED_PROJECT_ID'
    },
    {
      rx: /epics\/\d+/g,
      replacement: 'epics/REDACTED_EPIC_ID'
    },
    {
      rx: /tasks\/\d+/g,
      replacement: 'tasks/REDACTED_TASK_ID'
    },
    {
      rx: /userId=\d+/g,
      replacement: 'userId=REDACTED_USER_ID'
    },
  ];
  
  filter.forEach(function(item) {
    url = url.replace(item.rx, item.replacement);
  });
  
  return url;
}

Let's say the URL of my page is https://www.example.com/company/2247/projects/2114/epics/19258/tasks/19259?userId=1234567, this function redacts it to https://www.example.com/company/REDACTED_COMPANY_ID/projects/REDACTED_PROJECT_ID/epics/REDACTED_EPIC_ID/tasks/REDACTED_TASK_ID?userId=REDACTED_USER_ID.

Select newly added custom variable, it's name should be in Value field, and save your GA4 tag. enter image description here.

Now let's test it. Switch to Preview mode and open your web site. In GA head to Debug View of your GA4 property, wait for page_view to pop up in timeline (maybe you will have to reload you page again), click on it and expand page_location variable. You should see your redacted URL. enter image description here

That's all, enjoy!

petriq
  • 969
  • 1
  • 11
  • 24
  • What about excluding an e-mail? Will I add this: { rx: /[^\/]{4}(@|%40)(?!sub\.domain\.com)[^\/]{4}/gi, replacement: 'SOME-EMAIL' }, will be correct? – Marcin Milowski Aug 26 '21 at 13:17
  • Yes, with correct regular expression it will work. Your regex does not seem to be ok. – petriq Aug 30 '21 at 09:43
  • 1
    This does not seem to work for me, on a single page application. The "Redacted Page Location" function is executed (multiple times) for every event, verified by including a document.write(), but the "page_location" field is only set on the first page load and stays constant for every pageview after that. – Peter Dec 07 '21 at 14:56
  • Does your SPA use url rewrite or hash (#) routing? You may have to send pageView event manually everytime your SPA route change. – petriq Dec 12 '21 at 22:00
  • @petriq This was very helpful - thanks. I saw elsewhere that it is recommended to send this same redacted URL with any tags that set events. Do you or anyone else have any insight on how to determine if that is needed or not. My event tags don't seem to contain the URL in most cases. – Simon_Weaver Sep 25 '22 at 00:16
  • 1
    @Simon_Weaver This is global GA4 configuration tag, so if you've got separate custom-fired event tags and these tags use this configuration tag, you're safe. – petriq Sep 25 '22 at 19:20
  • After making some tests it seems that all events (including automatically collected) will inherit `page_location` value that was actual on the moment of first page load. So if you have SPA this approach won't work because you can't redefine every single automatically collected event in GTM. Any ideas how to approach this problem in SPA? Seems to be pretty generic issue but can't find any useful guidance – Ivan Samovar Nov 09 '22 at 00:26
  • @IvanSamovar It sounds like your website rewrites browser history with replaceState() or uses hash (#) routing. I think you need to add pageView event into dataLayer variable manually in website code each time your URL changes. – petriq Nov 09 '22 at 10:42
  • Yep, that's the case, we use `pushState()`. Manual page view event will solve issue with pages/screens report however all other automatically collected events (scrolls, file downloads, site search, etc) will still have wrong `page_location` (either from first page load if to follow this answer guide either will have original urls with PII) – Ivan Samovar Nov 09 '22 at 14:29
  • ok seems like solution is to add additional "History Change" trigger for GA4 configuration itself in GTM and disable "Page changes based on browser history events" in GA Enhanced measurement setup – Ivan Samovar Nov 09 '22 at 16:35