Replace accented character with html entity

Question

I'm trying to automate a series of queries but, I need to replace characters with accents with the corresponding html entity. It needs to be in Python3

Example:

vèlit 
[needs to become] 
v&egrave;lit

The thing is, whenever I try to do a word.replace, it doesn't find it.

This:

if u'è' in sentence:
    print(u'Found è')

Works and finds "è", but doing:

word.replace('è','&egrave;')

Doesn't do anything.

Strings can't be modified. `replace` creates a new string with the replaced text and returns it. `word = word.replace('è','è')` may be what you are after. — tdelaney, May 10 '18 at 19:00
As an aside, you don't need the `u` string qualifier in python 3 - strings are already unicode. — tdelaney, May 10 '18 at 19:01
Thank you! Only been working with Python for a few months so some things still escape me. Thanks! :D — Jordi, May 10 '18 at 20:19

snakecharmerb · Answer 1 · 2019-03-02T12:34:04.013

You can use the str.translate method and the data in python's html package to convert characters to the equivalent html entity.

To do this, str.translate needs a dictionary that maps characters (technically the character's integer representation, or ordinal) to html entities.

html.entities.codepoint2name contains the required data, but the entity names are not bounded by '&' and ';'. You can use a dict comprehension to create a table with the values you need.

Once the table has been created, call your string's translate method with the table as the argument and the result will be a new string in which any characters with an html entity equivalent will have been converted.

>>> import html.entities
>>> s = 'vèlit'

>>> # Create the translation table
>>> table = {k: '&{};'.format(v) for k, v in html.entities.codepoint2name.items()}

>>> s.translate(table)
'v&egrave;lit'

>>> 'Voilà'.translate(table)
'Voil&agrave;'

Be aware that accented latin characters may be represented by a combination of unicode code points: 'è' can be represented by the single code point - LATIN SMALL LETTER E WITH GRAVE - or two codepoints - LATIN SMALL LETTER E followed by COMBINING GRAVE ACCENT. In the latter case (known as the decomposed form), the translation will not work as expected.

To get around this, you can convert the two-codepoint decomposed form to the single codepoint composed form using the normalize function from the unicodedata module in Python's standard library.

>>> decomposed
'vèlit'
>>> decomposed == s
False
>>> len(decomposed)    # decomposed is longer than composed
6
>>> decomposed.translate(table)
'vèlit'
>>> composed = unicodedata.normalize('NFC', decomposed)
>>> composed == s
True
>>> composed.translate(table)
'v&egrave;lit'

This is by far a better and more generic solution to this common problem, for usually when there's one accented character in the text, there are others too an often ones you wouldn't think of at once. If the string to be translated already contains html code, add a condition to exclude < and > etc.: for k, v in html.entities.codepoint2name.items() if k > 0xa0 — user508402, Jan 24 '20 at 12:35

score 4 · Answer 2 · answered May 04 '21 at 13:20

As an update to the answer provided by snakecharmerb, it may be helpful to know that Python 3.3 introduced html.entities.html5 which maps more characters to the equivalent Unicode characters.

For me, I needed that dictionary because codepoint2name didn't include ł.

So, the code to create the translation table is slightly changed to this:

table = {get_wide_ordinal(v): '&{}'.format(k) for k, v in html.entities.html5.items()}

where get_wide_ordinal I got from https://stackoverflow.com/a/7291240/1233830:

def get_wide_ordinal(char):
    if len(char) != 2:
        return ord(char)
    return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)

because some of the characters in the html5 lookup are two-bytes wide.

Note that the HTML5 entities in this table do end with a ; which is why that is removed from the format string.

score 2 · Answer 3 · answered May 10 '18 at 19:03

2

Replace word.replace('è','è') with word = word.replace('è','è') and print the result to check.

word.replace('è','è') does work, but it doesn't actually make any changes to the word content itself.

Check str.replace()

answered May 10 '18 at 19:03

Anthony

421
1
5
13

Replace accented character with html entity

3 Answers3

Linked