I have a string that is encoded in UTF-8, and I am trying to display the text on a web page. I've noticed that any attempt I've made to convert any special characters to XML-encoded characters was a failure. I know what I'm doing wrong, but I don't know how to make it right.
Edit: The original question only showed the following string as one without the
b
prefix, without paying any attention to the conversion withstr()
. Below is the updated conversion process that was not shown.
Here's the example string I'm working with, which has a horizontal ellipsis at the end:
>>> html = b'<p>Lorem ipsum dolor sit amet\\xe2\\x80\\xa6</p>'
>>> html = str(html)
My problem is that UTF-8 characters are of variable length, so I can't just do something like this:
>>> import re
>>> re.sub(r'\\(x[a-f\d]{2})', r'&#\1;', html) # Don't do this!
'<p>Lorem ipsum dolor sit amet…</p>'
This gives three extended characters that are totally valid UTF-8, but not the right encoding. In my case, I can simply do:
>>> re.sub(r'\\xe2\\x80\\xa6', '…', html)
'<p>Lorem ipsum dolor sit amet…</p>'
But this only covers one of many character encodings. I obviously don't have the time, the patience, or any intention of writing substitutions for every character.
So, my question is this: how do I tell the byte-length of a character? Is there some byte mask I can use to tell if a byte is the first or last byte of a character? Any other method of determining the length, or a module that will do it for me, is welcome.