1

I have a string that is encoded in UTF-8, and I am trying to display the text on a web page. I've noticed that any attempt I've made to convert any special characters to XML-encoded characters was a failure. I know what I'm doing wrong, but I don't know how to make it right.

Edit: The original question only showed the following string as one without the b prefix, without paying any attention to the conversion with str(). Below is the updated conversion process that was not shown.

Here's the example string I'm working with, which has a horizontal ellipsis at the end:

>>> html = b'<p>Lorem ipsum dolor sit amet\\xe2\\x80\\xa6</p>'
>>> html = str(html)

My problem is that UTF-8 characters are of variable length, so I can't just do something like this:

>>> import re
>>> re.sub(r'\\(x[a-f\d]{2})', r'&#\1;', html) # Don't do this!
'<p>Lorem ipsum dolor sit amet&#xe2;&#x80;&#xa6;</p>'

This gives three extended characters that are totally valid UTF-8, but not the right encoding. In my case, I can simply do:

>>> re.sub(r'\\xe2\\x80\\xa6', '&hellip;', html)
'<p>Lorem ipsum dolor sit amet&hellip;</p>'

But this only covers one of many character encodings. I obviously don't have the time, the patience, or any intention of writing substitutions for every character.

So, my question is this: how do I tell the byte-length of a character? Is there some byte mask I can use to tell if a byte is the first or last byte of a character? Any other method of determining the length, or a module that will do it for me, is welcome.

Matt McCarthy
  • 424
  • 6
  • 19

1 Answers1

1

The html is being received as UTF8-encoded bytes. The bytes may be converted to a str by decoding them like this:

html = bytes_string.decode('utf-8')

or like this

html = str(bytes_string, 'utf-8')

Doing str(bytes_string) will not decode the bytes, it will return the repr of the bytes.

Once decoded, characters can be converted to the equivalent html entities using data from the html.entities module in the standard library, and str.translate.

from html import entities                                                                                                                                                  

# If we don;t want to convert html tags, don't include
# '<' and '>' in the translation table.                                                                                                                                                                               
skip = {ord(x) for x in '<>'}                                                                                                                                              
trans_table = {k: '&{};'.format(v) 
               for k, v in entities.codepoint2name.items() if k not in skip}                                                                           

translated = s.translate(trans_table)                                                                                                                                      
print(translated)

Output

<p>Lorem ipsum dolor sit amet&hellip;</p>

I discuss how the translation works in more depth in this answer.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153