11

I have a binary like this: 1101100110000110110110011000001011011000101001111101100010101000

and I want to convert it to utf-8. how can I do this in python?

Aidin.T
  • 731
  • 3
  • 10
  • 25
  • What encoding is the binary string in? ASCII? Or you mean the bytes are a utf-8-encoded string and you want to get a unicode string in python? – Claudiu Oct 08 '13 at 18:52
  • What do you mean with "convert it to utf-8"? Create the characters from the binary octets? – Paulo Bu Oct 08 '13 at 18:53
  • 1
    the binary string is in utf-8 and yes, I want to get a unicode string in python. – Aidin.T Oct 08 '13 at 18:55
  • I think we're not understanding precisely what sort of file you have. Could you run `hd` or `od` or a similar hex-dump utility and copy-paste the first few lines? – Robᵩ Oct 08 '13 at 18:57
  • it's not a file. I just have a text in persian and I convert it to binary, now I want to convert it back to the text. – Aidin.T Oct 08 '13 at 19:04
  • Tell us more. For example, how did you convert it to binary? – Robᵩ Oct 08 '13 at 19:05
  • This: https://sites.google.com/site/nathanlexwww/tools/utf8-convert – Aidin.T Oct 08 '13 at 19:07

4 Answers4

18

Cleaner version:

>>> test_string = '1101100110000110110110011000001011011000101001111101100010101000'
>>> print ('%x' % int(test_string, 2)).decode('hex').decode('utf-8')
نقاب

Inverse (from @Robᵩ's comment):

>>> '{:b}'.format(int(u'نقاب'.encode('utf-8').encode('hex'), 16))
1: '1101100110000110110110011000001011011000101001111101100010101000'
Igonato
  • 10,175
  • 3
  • 35
  • 64
  • but it doesn't work properly. it shows something else, not the first text I just converted to binary – Aidin.T Oct 08 '13 at 19:22
  • worked, thank you. I think I should move the check to this answer. It's really simpler – Aidin.T Oct 08 '13 at 19:37
  • 2
    And the inverse would be: `s=u'نقاب'; print '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))` – Robᵩ Oct 08 '13 at 19:48
  • @Robᵩ added to the answer with minor edit (I think in this case `.encode('utf-8')` in unnecessary). – Igonato Oct 08 '13 at 19:59
  • Or maybe I am wrong. Direct version worked for me without `.decode('utf-8')` as well. Any idea why that could happen? – Igonato Oct 08 '13 at 20:02
  • @Igonato - Dunno. With what you have in the answer, I get UnicodeEncodeError. – Robᵩ Oct 08 '13 at 20:16
  • I receive the text from user and put it in a variable called s. but /'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))/ won't work. how can I change s to unicode type? – Aidin.T Oct 08 '13 at 20:42
  • UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128) – Aidin.T Oct 08 '13 at 20:45
  • Works for me, I don't know how to help you. @Robᵩ can you test the string from previous comment? – Igonato Oct 08 '13 at 20:54
  • @Aidin.T if you testing it with a file ensure that it saved in utf-8 encoding and there is `# -*- coding: utf-8 -*-` line at the top. – Igonato Oct 08 '13 at 20:56
  • Did you execute this code? /s = "سلام" ; '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))/ – Aidin.T Oct 08 '13 at 20:57
  • Yes, I did. Works fine for me. – Igonato Oct 08 '13 at 20:58
  • @Igonato no, it's not a file – Aidin.T Oct 08 '13 at 20:59
  • Weird indeed. I am running out of ideas. Try to put it in file, save it with proper encoding and then run it from cmd like `python my_file.py`. Don't forget `# -*- coding: utf-8 -*-` – Igonato Oct 08 '13 at 21:06
  • 1
    Note that `s = "سلام"` and `s = u"سلام"` give different results. The former fails, the latter works. But let's stop solving the new problem. @Aidin.T, if you have a problem with *encoding*, please open a new question. – Robᵩ Oct 08 '13 at 21:12
4

Well, the idea I have is: 1. Split the string into octets 2. Convert the octet to hexadecimal using int and later chr 3. Join them and decode the utf-8 string into Unicode

This code works for me, but I'm not sure what does it print because I don't have utf-8 in my console (Windows :P ).

s = '1101100110000110110110011000001011011000101001111101100010101000'
u = "".join([chr(int(x,2)) for x in [s[i:i+8] 
                           for i in range(0,len(s), 8)
                           ]
            ])
d = u.decode('utf-8')

Hope this helps!

Paulo Bu
  • 29,294
  • 6
  • 74
  • 73
  • 3
    Hmmm, I'm somewhat suspicious of `unichr`. Because OP says his binary is already utf-8. utf-8 has variable character length, so I just used `chr` to join the raw bytes in a string and decode them later into Unicode. – Paulo Bu Oct 08 '13 at 19:09
  • 2
    @JoranBeasley - I disagree, assuming Python2. In that step he is collecting bytes, not characters. Only after he has the utf-8-encoded byte string does he want to convert. – Robᵩ Oct 08 '13 at 19:09
  • @Robᵩ: That's my point. Nice answer, love the `split('........')`. I think is basically the same idea as mine. +1 – Paulo Bu Oct 08 '13 at 19:11
  • 1
    +1 - This is the same technique as mine (so obviously I approve), plus you explained yours. Questioner should move the check to this better answer. – Robᵩ Oct 08 '13 at 19:12
3
>>> s='1101100110000110110110011000001011011000101001111101100010101000'
>>> print (''.join([chr(int(x,2)) for x in re.split('(........)', s) if x ])).decode('utf-8')
نقاب
>>> 

Or, the inverse:

>>> s=u'نقاب'
>>> ''.join(['{:b}'.format(ord(x)) for x in s.encode('utf-8')])
'1101100110000110110110011000001011011000101001111101100010101000'
>>> 
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • there is another question, how can I convert my text to binary by python? I mean the inverse form of my question – Aidin.T Oct 08 '13 at 19:10
1

Use:

def bin2text(s): return "".join([chr(int(s[i:i+8],2)) for i in xrange(0,len(s),8)])


>>> print bin2text("01110100011001010111001101110100")
>>> test
Nacib Neme
  • 859
  • 1
  • 17
  • 28