0

While I decode email subjects, I see this issue:

>>> s = '=?UTF-8?B?0LU=?=' 
>>> decode_subjects(s) 
'е' 
>>> decode_subjects(s).encode() 
b'\xd0\xb5' 
>>> 'e'.encode()  # 'e' in ascii letters 
b'e' 
>>> decode_subjects(s) == 'e' 
False

** decode_subject() is using from email.header import decode_header, make_header

s = '=?UTF-8?B?0LU=?=' will represent to the same with e in ASCII, but they are different.

Do we have other characters like that? Ex: b'\xSomething'.decode() => 'a' ....

How can I know it represented to which character in ASCII, by code?

Lê Tư Thành
  • 1,063
  • 2
  • 10
  • 19
  • 1
    These are called "homoglyphs". Check out [Is there a list of characters that look similar to English letters?](https://stackoverflow.com/q/9491890/4518341), [Homoglyph attack detection in email phishing](https://stackoverflow.com/q/22448369/4518341), and [this Python library](https://pypi.org/project/homoglyphs/) (never used it myself to be clear). – wjandrea May 19 '20 at 02:21
  • 1
    @wjandrea, it's very helpful. Thanks. – Lê Tư Thành May 21 '20 at 01:26

0 Answers0