2

1) How do I convert a variable with a string like "wdzi\xc4\x99czno\xc5\x9bci" into "wdzięczności"?

2) Also how do I convert string variable with characters like "±", "ę", "Ć" into correct letters?

I emphasise "variable" because all I've got from googling was examples with " u'some string' " and the like and I can't get anything like that to work.

I use "# -*- coding: utf-8 -*-" in second line of my script and I still crash into these problems.

Also I was said that simple print should output correctly - but it does not.

maxshuty
  • 9,708
  • 13
  • 64
  • 77
dyer
  • 49
  • 1
  • 3
  • 1
    Possible duplicate of [Process escape sequences in a string in Python](http://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python) – Gord Thompson Oct 14 '15 at 21:02

1 Answers1

3

In Python 2.7 IDLE, I get this output:

>>> print "wdzi\xc4\x99czno\xc5\x9bci".decode('utf-8')
wdzięczności

Your first string appears to be a UTF-8 byte string, so all that's necessary is to decode it into a Unicode string. When Python prints that string, it will encode it back to the proper encoding based on your environment.

If you're using Python 3 then you have a string that has been decoded improperly and will need a little more work to fix the damage.

>>> print("wdzi\xc4\x99czno\xc5\x9bci".encode('iso-8859-1').decode('utf-8'))
wdzięczności
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • This simple example did work, thanks. But I'm still geting errors like this in the script itself: `UnicodeEncodeError: 'latin-1' codec can't encode character '\u0119' in position 187: ordinal not in range(256)` Edit: Also I just noticed that now I have "wdzi\xc4\x99czno\xc2\xb6ci" in the output that with your decoding prints "wdzięczno¶ci" or `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 88: invalid start byte` Tried encoding in windows-1250, still errors. Any ideas? – dyer Oct 14 '15 at 21:00
  • @dyer if you're using Python 3, the strings you create within the program should be valid Unicode strings already that need no fixing. `'\u0119'` for example is [LATIN SMALL LETTER E WITH OGONEK `ę`](http://www.fileformat.info/info/unicode/char/0119/index.htm). My advice is *only* for strings that you received from outside the program that have been mangled, it's called Mojibake. The best solution is to prevent that from happening in the first place, but your question doesn't have enough info to tackle that problem. – Mark Ransom Oct 14 '15 at 21:12
  • @dyer You should have some understanding of the difference between a byte string and a Unicode text string → see [Unicode howto](https://docs.python.org/3/howto/unicode.html). If you end up with strings like `"±"` you might have opened a file using the wrong encoding, eg. use `open(path, encoding='utf-8')` instead of `open(path)`. – roeland Oct 14 '15 at 21:51
  • @roeland But is there a way to convert that to what it should be? How do I encode/decode that? That's improperly encoded character, right? I'm trying everything with string like "Obowi±zki wdziêczno¶ci" and I can't get it to work. – dyer Oct 14 '15 at 22:43
  • @roeland Also (can't edit now) I'm using BeutifulSoup and I have there "soup = bs4.BeautifulSoup(openfile.read(), "html.parser", from_encoding='utf-8')" but it still outputs it like that, so I'm trying to work on converting that output. – dyer Oct 14 '15 at 22:49
  • @dyer you **need to know** what encoding was used on the data and how it was read into your program. Without that you're just throwing darts and hoping you hit a target. That's no way to write software. – Mark Ransom Oct 14 '15 at 22:49
  • @dyer The usual way in Python 3 is that the file object handles the decoding, not the parser object. You will get better help if you include a small *but complete* program in your question. – roeland Oct 15 '15 at 00:14
  • @roeland Well, I've found the source of the problem: Firefox automatically saves htm file not in utf-8, but in gibberish like I've been dealing with. There is no choosing of encoding like when saving txt file - no utf8 - I have to copy source and paste it to notepad to have it properly saved manually. Also in my code I had to have: 'openfile = open(webpage, "r",encoding='utf-8')' - one more place where utf-8 has to be mentioned. Anyway I find it weird that it seems that I can't convert with Python that "Obowi±zki wdziêczno¶ci" if I really would have to deal with mess like that after it's done. – dyer Oct 15 '15 at 01:38
  • @dyer try using `encoding='mbcs'` and see if that helps. – Mark Ransom Oct 15 '15 at 01:58
  • @MarkRansom It didn't. But now I noticed that it's somehow unusual because it does save all polish letters except only three of them. There are ąćęłńóśżź - only ą, ś and ź aren't encoded. Weird, but it means that it can be pretty easily patched up in code by replacing them (patching from different, "Â" based wrong encoding would be a headache, but this one is easy). So it seems that is completely Firefox's fault and encoding like that is nonexistant (or is it?). Anyway, thanks for your help, now I'm stepping into offtopic with this - I'll try to find what's going on from Firefox side tomorrow. – dyer Oct 15 '15 at 02:32
  • In Firefox you can open the developer tools, go to Console, enable Net, and look for the charset in the response headers. It could be encoded in windows-1257 for example. Do not try to open these files in notepad, use an editor which lets you choose what encoding to use, eg. Notepad++. – roeland Oct 15 '15 at 02:40