25

I am trying to encode and decode the Hebrew string "שלום". However, after encoding, I get gibberish:

>>> word = "שלום"
>>> word = word.decode('UTF-8')
>>> word
u'\u05e9\u05dc\u05d5\u05dd'
>>> print word
שלום
>>> word = word.encode('UTF-8')
>>> word
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> print word
׳©׳׳•׳

How should I do it properly?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
user1767774
  • 1,775
  • 3
  • 24
  • 32
  • b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d' are the bytes that make up the utf8 string. When you print them as a string it it looks gibberish (in python2 (assuming std default encoding) but would look as in my comment in py3). If you then decode those bytes back using utf8 you will end up with the unicde string you started from. – paddyg Apr 24 '15 at 15:14
  • whats the result of `sys.getdefaultencoding()` in your terminal? – Mazdak Apr 24 '15 at 15:20
  • I get the string 'ascii'. – user1767774 Apr 24 '15 at 15:26
  • Can you add the python version you are using, please! – go2 Apr 24 '15 at 15:48
  • It's Python 2.7.3 and I'm using Pyscripter. – user1767774 Apr 24 '15 at 15:51
  • On 2.7.6, it works fine! Your code looks correct and there should be no major differences in that between the two. Have you tried running that directly through the Python interpreter? – go2 Apr 24 '15 at 16:03
  • ```>>> word = "שלום" >>> word '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d' >>> print word שלום >>> word = word.decode('UTF-8') >>> word u'\u05e9\u05dc\u05d5\u05dd' >>> print word שלום >>> word = word.encode('UTF-8') >>> word '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d' >>> print word שלום >>> – jonhurlock Apr 24 '15 at 16:28
  • what if i want to write both english and hebrew to the same file? which encoding do i use? – Daniel Apr 06 '20 at 09:16

1 Answers1

25

You'll have to make sure you have the right encoding in your environment (shell or script). If you're using a script include the following:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

To make sure your environment knows you're using UTF-8. You may find that your shell terminal will accept only ASCII, so make sure it is able to support UTF-8.

>>> word = "שלום"
>>> word
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> print word
שלום
>>> word = word.decode('UTF-8')
>>> word
u'\u05e9\u05dc\u05d5\u05dd'
>>> print word
שלום
>>> word = word.encode('UTF-8')
>>> word
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> print word
שלום
>>>
Veltzer Doron
  • 934
  • 2
  • 10
  • 31
jonhurlock
  • 1,798
  • 1
  • 18
  • 28