0

As part our desire to avoid unnecessary Personally Identifiable Information collection, I would like to prevent email addresses from being logged and later extend this to other info.

As we want to treat logs as streams, the obvious solution seems to be using sed. However I can't find any information on whether or not this is a good idea.

Presumably I can just pipe the service output through something like

#!/bin/sed -rf
# Obfuscate email addresses (e.g username@email.com => ####@email.com).
s/[[:alnum:]_+%.-]+@/###@/

And not have to worry about email addresses finding their way into the logs.

  1. Would this have some obvious negative consequence that I am missing? (Logs are new-line delimeted JSON, Average length 800 chars).
  2. Is there some standard way I could configure this at the kubernetes node/cluster level?

(I understand that this email regex is not comprehensive, there would still be the assumption that the logs need to be treated securely. I'm looking more for a belt and braces approach)

Max
  • 721
  • 5
  • 8
  • `the obvious solution seems to be using sed`: No this is not an obvious solution. What sort of logging facade are you using, is it `slf4j` ? – anubhava Nov 16 '20 at 14:40
  • 1
    Just an email terminus as commonly used on web forms can probably be matched just fine with a regex; the hard parts are in the decorations allowed around the address by RFC5322. There are also corner cases which are not in common use, like "quoted string"@example.com but if you are just cleaning out logs, you are probably fine with less than 100% accuracy. – tripleee Nov 16 '20 at 14:47
  • You probably want `#!/bin/sed -rf` where the `!` is crucial. (Less crucially, check the spelling of "obfuscate".) – tripleee Nov 16 '20 at 14:48
  • A trickier problem is people who fill in all kinds of non-email information because they can't read forms. What should you do about the email address `the body is buried on lot 123 on main street`? – tripleee Nov 16 '20 at 14:52
  • @anubhava, sed was obvious to me as I amd dealing with the log stream. This isn't something I want littered through my application code. – Max Nov 16 '20 at 14:52
  • @tripleee that wouldn't be an issue because it is not personally identifiable or relating to living persons and so outside the remit of the GDPR – Max Nov 16 '20 at 14:55
  • A few remarks: within `[...]`, you don't have to escape, so backslashes are treated literally; the `-` should be the first or last character in the bracket expression; you could use `[[:alnum:]]` instead of `[A-Za-z0-9]`; you seem to include a literal `+` right before the `@`, is that intentional? – Benjamin W. Nov 16 '20 at 16:15
  • Thanks @BenjaminW. for the sed regex tips. Updated. – Max Nov 17 '20 at 10:54

0 Answers0