0

Let's say I have:

  1. A proprietary Python library that reads a file in 'Latin-1'. I can't change the way it's read.
  2. As a result, a dataFrame1 is generated, where one of the values is meant to be stored as "Column€", but I can see from the debugger that it's stored as 'Column\x80'.
  3. I need to match this text value to a dataFrame2 (e.g. use "Column€" as a key for joining some data), and that second data frame is originally encoded in 'utf-8', e.g. "Column€". I am not able to change the input encoding here either.

Basically, I want both Data Frames to store "Col€" so that I could use it as a unique key to join my data frames.

I tried x.encode('utf-8') but it returns 'Col?'.

Decoding like x.decode('latin1').encode('utf-8') didn't work either (there are quite a lot of variations of it here on StackOverflow)

My gut feeling like is that there's some fundamental encoding knowledge missing.. :) What else could I try?

Zhiroslav
  • 51
  • 2
  • 5
  • Looks like it is actually windows 1252 in the first df (based on the x80 showing up for the euro symbol): https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html. But I guess that means, in the original data wherever that is. I guess you could convert the euro symbol in df2 to '\x80' - I suppose I am very weak in this particular type of problem though so I'll just admit right now I'm pretty lost too. I assume that after the files are read - whatever they are they are now in utf-8 even if they weren't read properly - you have to capture the correct encoding when reading, not afterward – topsail Jun 17 '22 at 19:45
  • See https://stackoverflow.com/q/11346283/5987 – Mark Ransom Jun 17 '22 at 19:54
  • Please [edit] your question to provide a [mcve]. – JosefZ Jun 17 '22 at 19:59

0 Answers0