4

I'm slowly moving from Perl to Python, and trying to understand the best practices of using regular expressions.

I have the following Perl code - this code basically takes a string as input and spits out rearranged string as output, based on regex match and capture:

#!/usr/bin/env perl

use strict;
use warnings;

my $str = $ARGV[0] || die "Arg?";

my $result;

if($str =~ m/^\d{12}$/) {
    $result = $str;
} elsif($str =~ m{^(\d{2})/(\d{2})/(\d{4})$}) {
    $result = "${1}${2}0000${3}";
} elsif($str =~ m{^(\d{4})$}) {
    $result = "01010000${1}";
} else {
    die "Invalid string";
}

print("Result: $result\n");

What would be a good equivalent in Python 3?

I came up with the following so far, but seems inefficient to match twice in the elif part. Also seems inefficient to compile all regular expressions at the start.

#!/usr/bin/env python3

import re, sys

str = sys.argv[1]

p1 = re.compile('\d{12}')
p2 = re.compile('(\d{2})/(\d{2})/(\d{4})')
p3 = re.compile('(\d{4})')

if p1.match(str):
    result = str
elif p2.match(str):
    m = p2.match(str)
    result = '%s%s0000%s' % (m.group(1), m.group(2), m.group(3))
elif p3.match(str):
    m = p3.match(str)
    result = '01010000%s' % (m.group(1))
else:
    raise Exception('Invalid string')

print('Result: ' + result)

Given Python's motto of "There should be one-- and preferably only one --obvious way to do it" - any ideas / suggestions of what the one best way here would be?

Thank you in advance for any suggestions.

Best regards, -Pavel

Pavel Chernikov
  • 2,186
  • 1
  • 20
  • 37

2 Answers2

2

Few notes about your code:

  1. Precompiled regexes
    There is no need to compile regexes explicitly if you aren't planning to reuse them. By using module level functions you get cleaner code:
    use m = re.match(pattern, text)
    instead of p1 = re.compile(pattern) followed by m = p1.match(str)

  2. Try to match, if matched - format output using matched groups
    Python regex facility provides a function which perfectly fits your case: re.subn(). It performs regex replacement and returns a number of replacements made.

  3. Performance considerations

    • re.match() called twice - it will attempt to match the same line twice and return two different match objects. It's likely will cost you some additional cycles.
    • re.compile() (or module level matching function) called twice - It's OK according to docs:

      Note: The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

    • How to avoid regex precompilation
      The code defines an order of regexes which should be followed when matching input string. It makes sense to compile regex only if we are 100% sure we'll need it. See code below. It's much simpler than actual explanation.
    • Premature optimization
      You're not experiencing any performance problems, aren't you? By optimizing this early you risk to spend some time without any observable effects.

The Motto:

import re

rules = ( (r'\d{12}', r'\g<0>')
        , (r'(\d{2})/(\d{2})/(\d{4})', r'\1\g<2>0000\3') 
        #using r'\1\20000\3' would imply group 1 followed by group 20000!
        , (r'(\d{4})', r'01010000\1') )

def transform(text):
    for regex, repl in rules:
        # we're compiling only those regexes we really need
        result, n = re.subn(regex, repl, text)
        if n: return result
    raise ValueError('Invalid string')

tests = ['1234', r'12/34/5678', '123456789012']
for test in tests:
    print(transform(test))

transform('this line supposed to trigger exception')

Hope this helped

Igor Korzhanov
  • 216
  • 1
  • 6
1

If you're absolutely determined not to perform the same regex match twice, you can do this:

p1 = re.compile('\d{12}')
p2 = re.compile('(\d{2})/(\d{2})/(\d{4})')
p3 = re.compile('(\d{4})')

# Functions to perform the processing steps required for each
# match- might be able to save some lines of code by making
# these lambdas
def result1(s, m):
    return s

def result2(s, m):
    return '%s%s0000%s' % (m.group(1), m.group(2), m.group(3))

def result3(s, m):
    return '01010000%s' % (m.group(1))

for pattern, result_getter in [(p1, result1), (p2, result2), (p3, result3)]:
    m = pattern.match(str)
    if m:
        result = result_getter(str, m)
        break

print('Result: ' + result)

Personally I think this level of micro-optimization won't make much difference, but there is a way to get it done.

Marius
  • 58,213
  • 16
  • 107
  • 105
  • Hi Marius - just to confirm - you are saying that my code is fine and that you would so something very similar and not worry at all about doing match twice? Thanks, – Pavel Chernikov Feb 04 '15 at 01:22
  • Yeah, basically. You probably shouldn't use `Exception` though, raise something more specific like `ValueError` – Marius Feb 04 '15 at 01:23