1

I have some problems with encoding using Python. I've searched for an answer for couple of hours now and still no luck.

I am currently working on Jupyter notebook with Python dataframes (pandas). Long story short - In a dataframe column I have different strings - single letters from the alphabet. I wanted to apply a function on this column, that will convert letters to numbers based on a specific key. But I got an error every time I tried this. When I dug for a reason behind this I realised that:

I have two strings 'T'. But they are not equal.

string1.encode() = b'T'  
string2.encode() = b'\xd0\xa2'

How can I standardize/encode/decode/modify all strings to have the same coding/basis so I can compare them and make operations on them? What is the easiest way to achieve that?

bad_coder
  • 11,289
  • 20
  • 44
  • 72
dtBane
  • 91
  • 1
  • 6
  • First `T` is latin letter with charcode `0x54`, second `Т` is cyril letter with charcode `0x422`. That's different letters, the only way I see is to make custom mapping to match letters which look similar. – Olvin Roght Jan 28 '22 at 21:56
  • Your dataframe comes from a file such as a .csv? if so, when you read it, are you specifying the encoding? – juuso Jan 28 '22 at 21:56
  • It's a dataframe from .csv file. I tried ensuring utf-8 encoding during reading data, but it didn't change anything. This is scraped data from wiki site, so it could contain some "shady" strings. – dtBane Jan 28 '22 at 22:00
  • @dtBane, take a look on [Is there a list of characters that look similar to English letters?](https://stackoverflow.com/q/9491890/10824407). – Olvin Roght Jan 28 '22 at 22:02
  • Thanks, it actually helped a lot! This was indeed just different letter, I had to manually change it. The more you know :) – dtBane Jan 30 '22 at 10:48

0 Answers0