Python: encoding issues - comparing two strings with different encoding

Question

I have some problems with encoding using Python. I've searched for an answer for couple of hours now and still no luck.

I am currently working on Jupyter notebook with Python dataframes (pandas). Long story short - In a dataframe column I have different strings - single letters from the alphabet. I wanted to apply a function on this column, that will convert letters to numbers based on a specific key. But I got an error every time I tried this. When I dug for a reason behind this I realised that:

I have two strings 'T'. But they are not equal.

string1.encode() = b'T'  
string2.encode() = b'\xd0\xa2'

How can I standardize/encode/decode/modify all strings to have the same coding/basis so I can compare them and make operations on them? What is the easiest way to achieve that?

First `T` is latin letter with charcode `0x54`, second `Т` is cyril letter with charcode `0x422`. That's different letters, the only way I see is to make custom mapping to match letters which look similar. — Olvin Roght, Jan 28 '22 at 21:56
Your dataframe comes from a file such as a .csv? if so, when you read it, are you specifying the encoding? — juuso, Jan 28 '22 at 21:56
It's a dataframe from .csv file. I tried ensuring utf-8 encoding during reading data, but it didn't change anything. This is scraped data from wiki site, so it could contain some "shady" strings. — dtBane, Jan 28 '22 at 22:00
@dtBane, take a look on [Is there a list of characters that look similar to English letters?](https://stackoverflow.com/q/9491890/10824407). — Olvin Roght, Jan 28 '22 at 22:02
Thanks, it actually helped a lot! This was indeed just different letter, I had to manually change it. The more you know :) — dtBane, Jan 30 '22 at 10:48

Python: encoding issues - comparing two strings with different encoding

0 Answers0