-1

OK, so I got this peculiar task :)

Assume we have a string of characters (a word) and it needs to be translated into another string of characters.

In it's simplest form this cloud be solved by using string.maketrans and string.translate.

However, in my case a combination of two characters from the first string should be translated into another combination or a single character of a result string, a single character could be translated into combination of two characters and finally a single character could be translated into single character e.g.

  ai -> should become e
  oi -> should become i

on the other hand

  8 -> should become th

but

  w should become o  
  y should become u  

other characters may stay intact e.g.

  a should remain a
  o should remain o   

So for the following input

aiakotoiwpy

the expected output would be

eakotiopu

One approach, I am thinking of is using hash table (for translations) and reading the input sting character by character and performing the replacement. I am wondering if there is any 'smarter' approach?

Any valuable input will be highly appreciated!

Thanks.

EDIT

Tried this (as was suggested):

d = {
        'ai': 'e',
        'ei': 'i',
        'oi': 'i',
        'o' : 'o',
        'a' : 'a',
        'w' : 'o',
        'y' : 'u'
    }
    s ="aiakotoiwpy"
    pattern = re.compile('|'.join(d.keys()))
    result = pattern.sub(lambda x: d[x.group()], s)

but the result is aiakotiopu not what was expected...

PKey
  • 3,715
  • 1
  • 14
  • 39

1 Answers1

1

The | (alternation) operator simply attempts matches from left to right. So, if we can move the two character keys to the left of the one character keys in the alternation, things should work better. We can do that by sorting in reverse with len() as our key function:

import re

d = {
    'ai': 'e',
    'ei': 'i',
    'oi': 'i',
    'o': 'o',
    'a': 'a',
    'w': 'o',
    'y': 'u',
}

s = "aiakotoiwpy"
pattern = re.compile('|'.join(sorted(d, key=len, reverse=True)))
result = pattern.sub(lambda x: d[x.group()], s)

print(result)

OUTPUT

eakotiopu
cdlane
  • 40,441
  • 5
  • 32
  • 81
  • You can't talk about `|` being greedy or not as the term "greedy" applies to regex quantifiers, while `|` is an alternation operator. The first alternative matched makes the regex engine skip all others in Python `re`, which is a common behavior in all NFA regexes. – Wiktor Stribiżew Oct 05 '16 at 07:39
  • Not sure what you mean by *in theory*: `(?p)` in [PyPi `regex`](https://pypi.python.org/pypi/regex) does exactly that in practice. – Wiktor Stribiżew Oct 05 '16 at 07:47
  • `|` cannot be greedy because it is a quantifier. My comment above is a purely terminological remark, no need to argue about it. Regex is not an easy thing, and sticking with the widely accepted terminology is reasonable in order to drown in this topic. – Wiktor Stribiżew Oct 05 '16 at 08:01
  • Thanks a lot, both approaches yours (cdlane) and @WiktorStribiżew (with OrderedDict) seems to have worked :) – PKey Oct 05 '16 at 08:05
  • I meant NOT a quantifier. I am a busy person, and I make errors when I am typing. – Wiktor Stribiżew Oct 05 '16 at 08:13