0

I'm writing a script in python that will replace email address and IP address with random characters in a log file.

Right now one line is read, the substitution for email address is done, and then the same line is read again to do the substitution for the IP address.

I want to do both in one go, but cannot find a way to do so.

I read the post here, but that seems to work only for simple substitutions. In this case, I'm substituting a (relatively) complex regex with a function, and can't figure out a way to do it in one line.

This is the original code:

import re
import hashlib


def hashing_func(to_hash):
    ''' Calculates hash '''
    return hashlib.sha256(to_hash).hexdigest()


def hashing_func_email(username, domain):
    ''' Creates a separate hash for username and
        domain. Appends 'EM_' to show it's an email.
        Reduces hash length to make it more readable.'''
    username_hash = hashing_func(username)
    domain_hash = hashing_func(domain)
    return 'EM_' + username_hash[:13] + '@' + domain_hash[:10]


def hashing_func_ipaddr(ipaddr):
    ''' Creates a hash for IP address, and appends 'IP_'.
        Reduces length to make it more readable.'''
    ipaddr_hash = hashing_func(ipaddr)
    return 'IP_' + ipaddr_hash[:11]


def main():
    email_regex = re.compile(r'''(
                                 [a-zA-Z0-9._+-]+)
                                 (@|%40)
                                 ([a-zA-Z0-9.-]+
                                 (\.[a-zA-Z0-9]{2,4})
                                 )''', re.VERBOSE)

    ipaddr_regex = re.compile(r'''\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                              (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
                              ''', re.VERBOSE)

    old_file = ["10.38.1.2, user@example.com, was here",
                "This user2@example.com tried from 84.12.41.53 again"]

    for line in old_file:
        new_line = re.sub(
            email_regex, lambda x: hashing_func_email(
                                x.group(1), x.group(3)), line)
        new_line_ip = re.sub(
            ipaddr_regex, lambda x: hashing_func_ipaddr(
                                         x.group(0)), new_line)

        print new_line_ip

if __name__ == '__main__':
    main()
ShanxT
  • 3
  • 2

1 Answers1

1

You can do something like this, I believe.

for line in old_file:
        new_line = re.sub(email_regex, \
        lambda x: hashing_func_email(x.group(1), x.group(3)), \
        re.sub(ipaddr_regex, lambda x: hashing_func_ipaddr(x.group(0)), line))

        print new_line  

EDIT
Don't reinvent the wheel, check these out:
How can I do multiple substitutions using regex in python?
Python Regex sub() with multiple patterns
https://www.safaribooksonline.com/library/view/python-cookbook-2nd/0596007973/ch01s19.html

Colonder
  • 1,556
  • 3
  • 20
  • 40
  • Identation looks awkward though in stack code snippet. – bhansa Sep 09 '17 at 20:00
  • I guess it's just my laziness. I will try to beautify it. – Colonder Sep 09 '17 at 20:01
  • Please, don't forget to accept the answer when you'll be able to :) – Colonder Sep 09 '17 at 20:04
  • Wow thank you so much! This is perfect! If I understand correctly, re.sub(ipaddr_regex) is an input to the re.sub(email_regex), which then applies both to the 'line'. Is that correct? I've also got a couple of other regexes I want to run along with these two, and simply tacking them on just works. – ShanxT Sep 09 '17 at 20:22
  • I had seen those links before posting. The problem is that they don't seem to accept complex regexes. So direct word substitution works, but if I use something like [abcd], it'll either ignore it or will give an error. For example, in the safaribooksonline link, I added this to the dictionary: `"[is]" : "x"` The expected result would be "Guido van Rossum xx the..", but it just ignored it. – ShanxT Sep 10 '17 at 01:35