How to convert messed up string to plain text

Question

I am using python3 to receive and process text messages from a telegram channel. I sometimes get messages containing a string like this:

Ехchanges: Віnance Futures

Looking pretty normal. But when I want to check

if 'Exchanges' in the_string:

I get

False

Trying to track this down:

the_string.encode()

yields

b'\xd0\x95\xd1\x85changes: \xd0\x92\xd1\x96nance Futures'

How can I convert this to a usual string?

'Exchanges: Binance Futures'

In your example, it looks like the first character is `U+0415 Cyrillic Capital Letter Ie`. It looks identical to the ASCII character `E`, but the visual similarity is a red herring, and you shouldn't expect Python to treat the characters as equal to each other just because they look they same. — water_ghosts, Mar 20 '21 at 21:23
Does this answer your question? [Translate Unicode to ascii (if possible)](https://stackoverflow.com/questions/43367355/translate-unicode-to-ascii-if-possible) or [Where is Python's “best ASCII for this Unicode” database?](https://stackoverflow.com/q/816285/4518341) — wjandrea, Mar 20 '21 at 21:34
@water_ghosts this makes sense. I will use the not - russian string then for the if condition. You can add this as an answer, I will mark it as solved — Tom Atix, Mar 20 '21 at 21:37
BTW, instead of using encoding for that analysis, you could use `ascii()`, which shows characters instead of bytes: `print(ascii(the_string))` -> `'\u0415\u0445changes: \u0412\u0456nance Futures'` — wjandrea, Mar 20 '21 at 21:42

score -1 · Answer 1 · answered Mar 20 '21 at 21:16

-1

Try to use encode() and decode() methods of the str class mixed together:

>>> my_string = 'Ехchanges: Віnance Futures'
>>> 'Ехchanges' in my_string
True
>>> my_string.encode()
b'\xd0\x95\xd1\x85changes: \xd0\x92\xd1\x96nance Futures'
>>> 'Ехchanges' in my_string.encode().decode()
True
>>>

answered Mar 20 '21 at 21:16

Funpy97

282
2
9

Doesn't work. Ехchanges: Віnance Futures this is the original string. I just wrote it as normal in the above example. The bytes representation is the correct one though. If I do encode and then decode, I get a correct looking string but still a False on the if condition. – Tom Atix Mar 20 '21 at 21:32
`'Ехchanges' in my_string` -> `True`??? You missed the whole point of the question. – wjandrea Mar 20 '21 at 21:43

score -1 · Answer 2 · answered Mar 20 '21 at 21:25

-1

It's utf-8 encoded string. You need to use string decoder decode('utf-8') here.

Solution:

encoded_string = b'\xd0\x95\xd1\x85changes: \xd0\x92\xd1\x96nance Futures'
decoded_string = encoded_string.decode("utf-8")
print(decoded_string)

answered Mar 20 '21 at 21:25

Gaurang Delvadiya

107
5

Doesn't do what it should... The string looks correct, yes. But the if condition is still False. – Tom Atix Mar 20 '21 at 21:33
`the_string` is a string. OP only tried encoding it to see what the underlying characters are. – wjandrea Mar 20 '21 at 21:37
Also, technically, you're talking about `bytes` objects, not `str`. – wjandrea Mar 20 '21 at 21:44

How to convert messed up string to plain text

2 Answers2