How to replace unicode characters by ascii characters in Python (perl script given)?

Question

I am trying to learn python and couldn't figure out how to translate the following perl script to python:

#!/usr/bin/perl -w                     

use open qw(:std :utf8);

while(<>) {
  s/\x{00E4}/ae/;
  s/\x{00F6}/oe/;
  s/\x{00FC}/ue/;
  print;
}

The script just changes unicode umlauts to alternative ascii output. (So the complete output is in ascii.) I would be grateful for any hints. Thanks!

http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database/816319#816319 — , Apr 23 '10 at 19:30
The given Perl script will actually only substitute the first occurrence on each line, but that's surely an accident. — tripleee, Dec 15 '13 at 16:52

score 47 · Answer 1 · answered Apr 23 '10 at 20:50

47

For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:

>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'

answered Apr 23 '10 at 20:50

Ian Bicking

9,762
6
33
32

2

which does not at all what the original .pl does (mainly properly transliterating german special characters) – Apr 23 '10 at 23:33
stripping the dots from german umlauts makes just about as much sense as stripping one leg from "x" and writing "y" or replacing "d" with "b" because the "kinda look the same". – Dec 04 '14 at 16:21
No, you might get collisions because you map different strings to the same one. – Radio Controlled Mar 08 '20 at 09:17

score 18 · Accepted Answer · edited May 04 '16 at 14:15

18

Use the fileinput module to loop over standard input or a list of files,
decode the lines you read from UTF-8 to unicode objects
then map any unicode characters you desire with the translate method

translit.py would look like this:

#!/usr/bin/env python2.6
# -*- coding: utf-8 -*-

import fileinput

table = {
          0xe4: u'ae',
          ord(u'ö'): u'oe',
          ord(u'ü'): u'ue',
          ord(u'ß'): None,
        }

for line in fileinput.input():
    s = line.decode('utf8')
    print s.translate(table),

And you could use it like this:

$ cat utf8.txt 
sömé täßt
sömé täßt
sömé täßt

$ ./translit.py utf8.txt 
soemé taet
soemé taet
soemé taet

Update:

In case you are using python 3 strings are by default unicode and you dont' need to encode it if it contains non-ASCII characters or even a non-Latin characters. So the solution will look as follow:

line = 'Verhältnismäßigkeit, Möglichkeit'

table = {
         ord('ä'): 'ae',
         ord('ö'): 'oe',
         ord('ü'): 'ue',
         ord('ß'): 'ss',
       }

line.translate(table)

>>> 'Verhaeltnismaessigkeit, Moeglichkeit'

edited May 04 '16 at 14:15

Dhia

10,119
11
58
69

answered Apr 23 '10 at 19:23

And to get ascii output the last line should be `print s.translate(table).encode('ascii', 'ignore')`, I guess. – Frank Apr 23 '10 at 20:00
strictly speaking the original .pl doesn't do that either, but yes, that would be one solution – Apr 23 '10 at 23:31
6

The objective appears to be de-umlauting German text, leaving it understandable. The effect of `ord(u'ß'): None` in this code is to **delete** the ß ("eszett") character. It should be `ord(u'ß'): u'ss'`. Upvotes?? Accepted answer??? – John Machin Apr 23 '10 at 23:50
6

oh. come. on. i tried to show the different possibilities for the map. – Apr 24 '10 at 02:02
1

You chose a very bad example of how to do something that the OP didn't indicate that he wanted or needed. – John Machin Apr 24 '10 at 02:15
1

@john: if you would take the OP's question literally together with his comment above ('ignore'), it would have the _exact_ _same_ outcome, so stop nitpicking already. – Apr 24 '10 at 06:13
Is there no library that can do this for us? – PascalVKooten Mar 24 '17 at 14:03
@PascalvKooten: there are several, search for "unicode" on pypi. try `unidecode` for example. it has the same problem of stripping `ö` down to `o`, etc. – Mar 24 '17 at 16:03

score 6 · Answer 3 · answered May 06 '14 at 19:06

You could try unidecode to convert Unicode into ascii instead of writing manual regular expressions. It is a Python port of Text::Unidecode Perl module:

#!/usr/bin/env python
import fileinput
import locale
from contextlib import closing
from unidecode import unidecode # $ pip install unidecode

def toascii(files=None, encoding=None, bufsize=-1):
    if encoding is None:
        encoding = locale.getpreferredencoding(False)
    with closing(fileinput.FileInput(files=files, bufsize=bufsize)) as file:
        for line in file: 
            print unidecode(line.decode(encoding)),

if __name__ == "__main__":
    import sys
    toascii(encoding=sys.argv.pop(1) if len(sys.argv) > 1 else None)

It uses FileInput class to avoid global state.

Example:

$ echo 'äöüß' | python toascii.py utf-8
aouss

score 3 · Answer 4 · answered Dec 15 '13 at 16:42

I use translitcodec

>>> import translitcodec
>>> print '\xe4'.decode('latin-1')
ä
>>> print '\xe4'.decode('latin-1').encode('translit/long').encode('ascii')
ae
>>> print '\xe4'.decode('latin-1').encode('translit/short').encode('ascii')
a

You can change the decode language to whatever you need. You may want a simple function to reduce length of a single implementation.

def fancy2ascii(s):
    return s.decode('latin-1').encode('translit/long').encode('ascii')

Radio Controlled · Answer 5 · 2020-03-08T09:37:14.433

-3

Quick and dirty (python2):

def make_ascii(string):
    return string.decode('utf-8').replace(u'ü','ue').replace(u'ö','oe').replace(u'ä','ae').replace(u'ß','ss').encode('ascii','ignore');

edited Mar 08 '20 at 09:37

answered Mar 08 '20 at 09:28

Radio Controlled

825
8
23

How to replace unicode characters by ascii characters in Python (perl script given)?

5 Answers5

Linked

Related