How do I decode utf-8 from a file

Question

I have a program that attempts to encrypt a message using aes. The problem arises when I have to encrypt the message and I get TypeError: Object type <class 'str'> cannot be passed to C code. I found that if I encode it to utf-8 it works, but then when I try to decrypt it it doesn't get rid of the b'...' and the base64 decryption fails, making my iv not 16 bytes. Whenever I try to decode the first line of the file using aes.decrypt(file.readline().decode("utf-8")) it says I can't use decode on a str.

from Crypto.Cipher import AES
from Crypto import Random

def pad(s):
    pad = s + (16 - len(s) % 16) * chr(16 - len(s) % 16)
    return str(pad)

def unpad(s):
    unpad = s[:-ord(s[len(s)-1:])]
    return str(unpad)


class AESCipher:
    def __init__( self, key ):
    self.key = key

    def encrypt( self, s ):
        raw = pad(s)
        iv = Random.new().read( AES.block_size )
        cipher = AES.new( self.key, AES.MODE_CBC, iv )
        return base64.b64encode( iv + cipher.encrypt( raw.encode("utf-8") ) )

    def decrypt( self, enc ):
        enc = base64.b64decode(enc)
        iv = enc[:16]
        cipher = AES.new(self.key, AES.MODE_CBC, iv )
        return unpad(cipher.decrypt( enc[16:] ))

I'm new to encryption so I don't really know if this has been answered before and I just don't know how to word it, but I've been looking around for a few hours and haven't found anything. Thank you. Again, sorry if this isn't worded properly.

You should always use bytes when encrypting/decrypting data. Then you can decode to string (if plaintext), or base-46 encode (if ciphertext). The `b''` signifies a byte string ant it is not part of the actual string, it's just the string's representation. You can 'get rid of it' if you decode to string (`b''.decode()`). Remember, `'s'.encode()` -> bytes, `b's'.decode()` -> string. — t.m.adam, Dec 09 '18 at 10:22
Where are `pad` and `unpad` defined? Always include the definitions of non-standard (or non-obvious) functions in your sample code. — Tomalak, Dec 09 '18 at 11:47
`def pad(s): pad = s + (16 - len(s) % 16) * chr(16 - len(s) % 16) return str(pad) def unpad(s): unpad = s[:-ord(s[len(s)-1:])] return str(unpad)` — Bon Jon, Dec 09 '18 at 16:45
@t.m.adam The problem was when I read from the file it acted like a literal string of "b's'" and wouldn't let me decode it. — Bon Jon, Dec 09 '18 at 16:46
(You can always edit your posts - addidtions and new code should not go in the comments.) — Tomalak, Dec 09 '18 at 16:48
Maybe you used `str()` to convert to string (eg `str(b'data')`)? Use `.decode()` to convert bytes to string (eg `b'data'.decode()`). It's best to design your `.encrypt()` and `.decrypt()` methods to expect and return bytes, as Tomalak suggests. Finally, PyCryptodome provides padding functions in `Crypto.Util.Padding`. — t.m.adam, Dec 09 '18 at 17:05

Tomalak · Accepted Answer · 2018-12-09T13:13:19.350

4

Your encrypt and decrypt operations are not mirror images of each other.

def encrypt( self, s ):
    iv = Random.new().read( AES.block_size )       # new IV
    cipher = AES.new( self.key, AES.MODE_CBC, iv ) # create cipher
    payload = s.encode("utf-8")                    # string to bytes
    encrypted = cipher.encrypt(pad(payload))       # pad before encrypt
    return base64.b64encode( iv + encrypted )      # b64 data

def decrypt( self, enc ):
    data = base64.b64decode( enc )                 # b64 data
    iv = data[:AES.block_size]                     # split it up
    encrypted = data[AES.block_size:]              # 
    cipher = AES.new(self.key, AES.MODE_CBC, iv )  # recreate cipher
    payload = unpad(cipher.decrypt( encrypted ))   # unpad after decrypt
    return payload.decode("utf8")                  # bytes to string

Only bytes can be encrypted. Strings are not bytes, so encoding strings into a byte representation first is necessary. UTF-8 is a suitable representation, but it could be UTF-16 or even UTF-32 (read about the differences).

However, since the cipher can handle any byte payload, I would remove the part that currently limits these functions to strings. I'd change them to expect and return bytes, and then either:

call them as x = aes.encrypt(s.encode('utf8')) and s = aes.decrypt(x).decode('utf8'), respectively, or
make wrapper functions for string handling.

For encrypting files you can then directly do this:

with open('some.txt', 'rb') as fp:
    encrypted = aes.encrypt(fp.read())

and this would not impose any encoding assumptions at all, but encrypt the bytes of the file as they are.

AES is a block cipher, which means encrypt(a) + encrypt(b) is the same as encrypt(a + b). For encrypting files that's very useful, because you can read the file incrementally in chunks of N * AES.block_size, with only the last chunk padded. This is a a lot more memory-efficient than reading the whole file into memory first. Your current setup of encrypt and decrypt does not make use of that.

edited Dec 09 '18 at 13:13

answered Dec 09 '18 at 11:45

Tomalak

332,285
67
532
628

I tried this but now I'm getting an error during decrypt that the data needs to be padded to 16 byte chunks: `ValueError: Data must be padded to 16 byte boundary in CBC mode` – Bon Jon Dec 09 '18 at 17:16
It is padded. The "unpad" call happens after the decrypt call, and the input to `decrypt` is the padded bytes from the base-64 encoded data. – Tomalak Dec 09 '18 at 17:20
Note that your `pad` and `unpad` functions must handle `bytes`, not `str`. You need to modify them accordingly. – Tomalak Dec 09 '18 at 17:24
Sorry for the dumb question, but how do I make it handle bytes instead of str? – Bon Jon Dec 09 '18 at 18:27
Your function does `input + (N * chr(something))`, in other words it does `input + str`. This can only succeed if `input` is a string itself. You need to do `input + bytes`, so `input` can be `bytes`. And that means the part `N * chr(something)` must change to something that produces `bytes`. – Tomalak Dec 09 '18 at 18:49
I've modified it by encoding the chr(x) to utf8, but I'm still getting the error. I'm unsure if I did it wrong, or if there's something else. – Bon Jon Dec 09 '18 at 19:20
If `pad(b'Hello')` results in `b'Hello\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b\x0b'` then it's correct. Other than that I can only tell you scrutinize your code line-by-line. I've tested my decrypt function, it works. Unless you're doing something wrong the error "Data must be padded to 16 byte boundary in CBC mode" cannot happen. – Tomalak Dec 09 '18 at 19:30
It does leave me with b'Hello\x0b...' like you say. I found that my b64decode is outputting something other than what I input to b64encode. Do you know of a way to fix this? – Bon Jon Dec 09 '18 at 20:54
No, because that's impossible. :D `b64encode()` and `b64decode()` inverse operations, they can't produce mis-matched output. `input == b64decode(b64encode(input))`. But you are on the right track, somewhere there must be a difference. – Tomalak Dec 09 '18 at 20:58
Thank you for all the help. It was because I was casting the encoded data into a str when I wrote it to the file it wouldn't read it as b'...' it read it as a str of "b'...'". Thanks again for the help – Bon Jon Dec 09 '18 at 21:15
No problem at all, you're welcome. Glad you found it in the end! – Tomalak Dec 09 '18 at 21:16
@BonJon OBTW, when you are writing encrypted data to file, **don't** base-64 encode it. Write the bytes to the file directly, skip the base64. – Tomalak Dec 09 '18 at 21:19
Ok I will, can you elaborate on why? – Bon Jon Dec 09 '18 at 21:28
A) base64 encoding makes the data about 1/3 longer, and b) Files are designed to hold bytes. You have bytes. Store them in the file. c) Encoding and decoding base64 costs additional CPU cycles and adds zero value here. base64 is a workaround for storing binary data in an environment that is designed for strings (for example a VARCHAR database column, or in an HTTP header, or an XML element, ...). If you don't have that restriction, you don't need to use that workaround. – Tomalak Dec 09 '18 at 21:34
Ah ok thank you for the information, I'll keep that in mind. – Bon Jon Dec 09 '18 at 21:38

How do I decode utf-8 from a file

1 Answers1