-1

I'm trying that a Python program reads a word made by another Python program which was encoded to UTF-8 and saved on a txt file.

For example, the string it gets might be:

b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'

being this a normal string, like doing this:

word_string = "b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'"

How do I make the script see this is a bytes string and not a normal string? I know this can be done like

word_bytes = b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'

but if I have the content of that variable 'word_bytes' already written in a file, how can I get it and make the program understand it just has to decode it? Because I try to decode it and it says it's a string and can't be decoded. Any help?

Thanks in advance!

UPDATE: So just to put here to anyone who gets the string from a file on at least Windows (I'm using Windows 7), with tripleee's answer, it will encode and put double backslashes on the bytes part, and when it decodes, it will just remove one of the backslashes, putting it as it was before. So the way to get it from a file and decode it is the following:

s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'.encode().decode('unicode_escape') [having the bytes part between '' been gotten from a file using the open(file,"r") function, in my case]
s.encode('latin-1').decode('utf-8') [or ISO-8859-1, as it seems it's the same thing]

EDIT: tripleee's answer is almost what I wanted to know (50% missing), but it's already a way, so thank you! But how could I do it not knowing the encoding (because in this case, I didn't know the encoding was latin-1 and I can't put all the encodings there)? Like I would do by just putting a b before the bytes string like in 'word_bytes' variable (possibly it might encode with the right encoding automatically? I wanted to do that too but possibly with a funcion to a variable that has already the bytes part).

Edw590
  • 447
  • 1
  • 6
  • 23

2 Answers2

2

If you have the bytes in a variable already, you are all set. If you have the bytes in a string, I'm assuming you basically have a sequence of characters where the code point value of each is equivalent to the byte value it's supposed to hold. This happens to be the definition of the Latin-1 encoding - it feels a bit dirty, but the trick is to encode your string as Latin-1, then decode back as UTF-8.

>>> s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
>>> s.encode('latin-1').decode('utf-8')
'форум'
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • So I would have to always know the encoding to make it work? Because in this example, I didn't know the encoding and just put the b before the string to make the program realize it's a bytes string and it probably would do it automatically, right? Is there no way to do it without knowing the encoding? – Edw590 May 08 '19 at 18:27
  • You can attempt to decode to various different encodings and see what you get; but no, there is no way in the general case to know if a sequence of bytes corresponds to intelligible human language in some encoding without a human referee or some serious NLP. – tripleee May 08 '19 at 18:51
  • If you have a good idea of what text the bytes might represent, you can try to look them up e.g. at https://tripleee.github.io/8bit/ – tripleee May 08 '19 at 18:53
  • Oh right, I didn't know. But then how does the b do it automatically? I didn't need to choose the encoding, and if I put `print(b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'.decode())` it works perfectly and prints the right word. It's done in another way? Btw, thanks for that page! – Edw590 May 08 '19 at 18:56
  • The implicit encoding in Python 3 is UTF-8 most places, but it only works perfectly in those places. There is no additional logic in `.decode()`, you just happened to be lucky. (When we have https://utf8everywhere.org/ we can be lucky everywhere.) In other words, Python supplies a (technically site-specific) default if you don't put in your encoding explicitly; but remember, *explicit is better than implicit.* – tripleee May 09 '19 at 03:06
  • If you are asking "how can I know Latin-1 would encode a byte to exactly the same character value" that's the definition of Latin-1, and no other 8-bit encoding has this property. If you don't know that, it's arguably obscure knowledge which is hard to find. Examining the Python standard library would reveal that this is a technique they use when they want to accomplish this task. – tripleee May 09 '19 at 03:13
0

you can identify if the string is in bytes using

def identifystring(string):
    if isinstance(string, str):
        print ("ordinary string")
    elif isinstance(string, unicode):
        print ("unicode string")
    else:
        print ("no string")
abssab
  • 103
  • 1
  • 3
  • 10
  • Actually it's not that what I wanted to know, as I already know the string is in bytes, so the only thing I need to do is to be able to decode it after putting it in bytes form. – Edw590 May 08 '19 at 19:05