1

I need to replace German Umlauts (Ä, ä, Ö, ö, Ü, ü, ß) with their two-letter equivalents (Ae, ae, Oe, oe, Ue, ue, ss).

Currently, I have this function, but the string's length changes:

def _translate_umlauts(s):
    """Translate a string into ASCII.

    This Umlaut translation comes from http://stackoverflow.com/a/2400577/152439
    """
    trans = {"\xe4" : "ae"}   # and more ...
    patt = re.compile("|".join(trans.keys()))
    return patt.sub(lambda x: trans[x.group()], s)

However, I have the requirement that the string's total length should not change. For example, Mär should become Mae.

Any help in deriving the appropriate solution (regex?) is greatly appreciated :)

andreas-h
  • 10,679
  • 18
  • 60
  • 78
  • 4
    Well, you can regex-match `Ä.` and replace it with `Ae`... but that won't work if the last character is `Ä`, and indiscriminately eating the following character is a pretty odd thing to do, isn't it? – Sneftel Oct 11 '13 at 09:42
  • The string length should not change? What a stupid requirement is this? –  Oct 11 '13 at 10:00
  • 1
    This is odd. After doing the replacement, how do you tell the difference between "löten" and "lösen" which both result in "loeen" ? – Ber Oct 11 '13 at 10:05

2 Answers2

1

... the string's total length should not change.

Well, that's an odd requirement, but

patt = re.compile("([" + "".join(trans.keys()) + "]).")

Note that this will not replace the umlaut if it is the last character in the string. For obvious reasons this would change the string length.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
1

Just truncate back to the original string length:

return patt.sub(lambda x: trans[x.group()], s)[:len(s)]
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251