0

I have a ruby filter to hopefully match an email address in a log message, remove it, and to pass it through an anonymization filter, something like this...

  ruby { 
  code =>
    "
    begin
      if !event['log_message'].nil?
        if match = event['log_message'].match(/(\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)/i) 
          event['user_email'] = match[1]
        end
      else
        puts 'Oddity parsing message: log_message is nil'
        puts event.to_yaml
      end
    rescue Exception => e
      puts 'Exception parsing user email:'
      puts e.message
    end
    "
}
if [user_email] {
  anonymize {  
    algorithm => "SHA1"
    fields => ["user_email"]
    key => "mySuperSecretPassword"
  }
  ruby {
    code =>
      "
      begin
        event['message'].gsub!(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i, event['user_email'])
        event['log_message'].gsub!(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i, event['user_email'])
      rescue Exception => e
        puts 'Exception replacing user-email in log:'
        puts e.message
      end
      "
      remove_field => ["user_email"]              
  }
}

As of now, this regex isn't catching much of anything. I tried replacing it and I got an error (which was the "oddity parsing message" branch of my code).

Does anyone know roughly how to do this? I don't need a crazy over-the-top regex, just one to catch 99% of email addresses. The regex I tried to use was

if match = event['log_message'].match(/(\b[a-zA-Z0-9_.+=:-]+@[0-9A-Za-z][0-9A-Za-z-]{0,62}(?:\.(?:[0-9A-Za-z][0-‌​9A-Za-z-]{0,62}))*\b)/i)

Here's a log line for reference

76817815   11/Jun/2016 00:04:28 +0000  INFO  [eventListener-3] messagingsvc logDefault    > doSend - Sending email... From: "Test" <do-not-reply@test.com>

Note If this can be done easier / in a more sane way using Grok, I'm completely open to removing the ruby bit.

A_Elric
  • 3,508
  • 13
  • 52
  • 85
  • *I don't need a crazy over-the-top regex, just one to catch 99% of email addresses* makes this question a duplicate of http://stackoverflow.com/questions/14440444/extract-all-email-addresses-from-bulk-text-using-jquery, http://stackoverflow.com/questions/3194407/extract-all-email-addresses-from-some-txt-documents-using-ruby, and I guess many more. Why post just another question like "give me an email regex"? – Wiktor Stribiżew Jul 25 '16 at 18:41
  • The specifics of getting it to fit into either a grok or a ruby filter for logstash is a bit different. Also, there's a decided lack of documentation around how to properly do this online – A_Elric Jul 25 '16 at 18:45
  • All those I linked to fit Oniguruma regex flavor. – Wiktor Stribiżew Jul 25 '16 at 18:50
  • Here is one from mine: http://stackoverflow.com/a/37963296/3832970 – Wiktor Stribiżew Jul 25 '16 at 18:56
  • @WiktorStribiżew - Unless you can show how any email regex is a standard that satisfies the email RFC then you can't be marking them as duplicates. –  Jul 25 '16 at 19:25

1 Answers1

1

This is from html5 spec

 [a-zA-Z0-9.!#$%&'*+/=?^_\`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*

Expanded

 [a-zA-Z0-9.!#$%&'*+/=?^_\`{|}~-]+ 
 @
 [a-zA-Z0-9] 
 (?: [a-zA-Z0-9-]{0,61} [a-zA-Z0-9] )?
 (?:
      \. [a-zA-Z0-9] 
      (?: [a-zA-Z0-9-]{0,61} [a-zA-Z0-9] )?
 )*