3

I am trying to convert Arabic text to utf8 encoded bytes then to binary by using this answer here.

First, I used the code as it is in the example:

'{:b}'.format(int(u'سلام'.encode('utf-8').encode('hex'), 16))

But I got this error:

AttributeError: 'bytes' object has no attribute 'encode'

Also I removed .encode('hex') but still gives the same error.

Is there any way to convert utf8 codes to binary and vise versa?

SCB
  • 5,821
  • 1
  • 34
  • 43
Nujud
  • 81
  • 1
  • 7
  • Possible duplicate of https://stackoverflow.com/questions/8815592/convert-bytes-to-bits-in-python? – pstatix Dec 22 '17 at 22:56
  • @pstatix and how can I do the inverse ? from bits to utf8 – Nujud Dec 22 '17 at 23:15
  • 2
    you don't encode utf8 to something, it is already encoded in a sequence of bytes, you can only decode it to text (text was originally encoded to utf-8, that was your beginning point) – progmatico Dec 22 '17 at 23:19
  • to do the inverse of encode, you decode. – progmatico Dec 22 '17 at 23:22
  • Quick question, why do you want to get this to binary in the first place. If this is just about encoding then there are significantly better ways of doing it such as the [`base64`](https://docs.python.org/3/library/base64.html) modules. – SCB Dec 22 '17 at 23:29
  • If you do want binary, would recommend looking at a library such as [bitstring](https://pypi.python.org/pypi/bitstring/3.1.3) to do it for you. – SCB Dec 22 '17 at 23:30
  • @SCB I want to use the binary for security purposes needs in my program. So I need them in binary first then use this binary string alone, and when I received binary again I want to back it in utf8 – Nujud Dec 22 '17 at 23:52

1 Answers1

2

How about this:

>>> ''.join('{:08b}'.format(b) for b in 'سلام'.encode('utf8'))
'1101100010110011110110011000010011011000101001111101100110000101'

This iterates over the encoded bytes object, where you get an integer in the range 0..255 for each iteration. Then the integer is formatted in binary notation with zero padding up to 8 digits. Then glue everything together with str.join().

For the inverse, the approach given in an answer from the question you linked to can be adapted to Python 3 as follows (s is the output of the above example, ie. a str of 0s and 1s):

>>> import re
>>> bytes(int(b, 2) for b in re.split('(........)', s) if b).decode('utf8')
'سلام'
lenz
  • 5,658
  • 5
  • 24
  • 44