How to detect confusing decoding?

Asked May 19 '20 at 01:46

Active May 19 '20 at 02:00

Viewed 165 times

While I decode email subjects, I see this issue:

>>> s = '=?UTF-8?B?0LU=?=' 
>>> decode_subjects(s) 
'е' 
>>> decode_subjects(s).encode() 
b'\xd0\xb5' 
>>> 'e'.encode()  # 'e' in ascii letters 
b'e' 
>>> decode_subjects(s) == 'e' 
False

** decode_subject() is using from email.header import decode_header, make_header

s = '=?UTF-8?B?0LU=?=' will represent to the same with e in ASCII, but they are different.

Do we have other characters like that? Ex: b'\xSomething'.decode() => 'a' ....

How can I know it represented to which character in ASCII, by code?

edited May 19 '20 at 02:00

asked May 19 '20 at 01:46

Lê Tư Thành

1,063
2
10
19

1

These are called "homoglyphs". Check out [Is there a list of characters that look similar to English letters?](https://stackoverflow.com/q/9491890/4518341), [Homoglyph attack detection in email phishing](https://stackoverflow.com/q/22448369/4518341), and [this Python library](https://pypi.org/project/homoglyphs/) (never used it myself to be clear). – wjandrea May 19 '20 at 02:21
1

@wjandrea, it's very helpful. Thanks. – Lê Tư Thành May 21 '20 at 01:26

How to detect confusing decoding?

0 Answers0