How can I effectively store binary data in a file that's in a text format like CSV?

Question

I'm currently working on a password storage program in Python, though C would likely be faster. I've been trying for the past hour or so to find a way to store a bytes object in a CSV file. I'm hashing the passwords with their own salt, and then storing that, and grabbing it again to check the password. It works perfectly well when it's stored in memory.

salt = os.urandom(64)
hash = hashlib.pbkdf2_hmac(
    'sha256',
    password.encode('utf-8'),
    salt,
    1000000
)
storage = salt + hash
salt_from_store = storage[:64]
hash_from_store = storage[64:]

However, when I try storing it in a CSV file, so it doesn't have to be constantly running, I get an error,

TypeError: write() argument must be str, not bytes

So, I converted it to a string using,

str(storage)

and that wrote just fine. But then, when I get it from the file, it's still a string, and the length goes from 128 (bytes) to 300+ (chars). It's also never consistent. I don't know the encoding, so I can't change it like that, when I print the bytes, it's a bunch of characters with backslashes and X's

b'\xfd\x3a'

and occasionally some random special characters. I'm not sure if there's a way to convert that to an int, and let it be converted back. Another issue is that I've found a way to do it, by changing

b"\xf1\x96"

to

"b\xf1\x96"

which prints the encoded text, rather than the bytes it's made up of. However, I don't know if that's a good way of changing it, and if it is, if there's a way to do it without something like

bytes[0] = '"'
bytes[1] = 'b'

To write bytes, either write to something that expects to contain bytes, or write text that represents the bytes in some way. CSV is fundamentally a text-based format. — Karl Knechtel, Aug 24 '21 at 22:59
Okay? Are you going to elaborate on that? Maybe give an example? Or will you just say, "You're doing it wrong, but I don't care enough to help you fix it,"? — Xarvveron, Aug 24 '21 at 23:09
I wrote an answer. These things take time. That said, you *are* [expected to do some research yourself](https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users). For example, it is helpful to put things like `python convert bytes to str` [into a search engine](https://duckduckgo.com/?q=python+convert+bytes+to+str). — Karl Knechtel, Aug 24 '21 at 23:13
Python will be plenty fast enough for anything that is I/O bound. — chepner, Aug 24 '21 at 23:36

score 1 · Answer 1 · answered Aug 24 '21 at 23:12

To write bytes, either write to something that expects to contain bytes, or write text that represents the bytes in some way. CSV is fundamentally a text-based format. If you're going to use a CSV file, then you're going to open it in text mode, and write text to it.

Fundamentally, every file on the hard drive consists of bytes. This implies that, when you open the CSV file, you will be choosing (or using a default) text encoding scheme. So your bytes object will have to be converted twice (to text, and then into the underlying bytes in the file - which you could verify for example with a hex editor) on writing, and twice again on reading. That's just the reality of dealing with mixed data. Thankfully, half that work is taken care of for you automatically (by the open call, or wrappers for that like csv.Reader).

So, I converted it to a string using str(storage)

This is not actually a conversion in the sense that you're most likely interested in. This is asking for a printable, human-readable representation of the object (There is also repr, which asks for a more technically-oriented representation. For str and bytes objects, that's where the enclosing quotation marks come from, among other adjustments. When you print something, its str is used. When you evaluate something at the REPL, you see the repr of the result - except that when the result is None, it doesn't show anything at all). Specifically for dealing with bytes and str objects, Python has a concept of encoding and decoding, which uses explicit .encode (str->bytes) and .decode (bytes->str) methods. These are topics you can easily look up in the documentation (or previous Stack Overflow questions, or on the Internet in general).

when I print the bytes, it's a bunch of characters with backslashes and X's

Yes, this is the form that Python uses to tell you what data exists inside the bytes object. What you're saying here is basically the same as "when I print the list, it's a bunch of list elements with commas surrounded by square brackets", or "when I print the integer, it's a bunch of digit symbols".

But then, when I get it from the file, it's still a string, and the length goes from 128 (bytes) to 300+ (chars).

So decode it again. Of course you do need to encode properly. Everything that you get from the file will be a string, because you are opening the file in text mode, because CSV is a text format. (Incidentally, you are using the csv standard library module for this, right?)

It's also never consistent. I don't know the encoding

So tell it which encoding to use; and if you need to use a consistent amount of text, choose an encoding that consistently maps one byte to one Unicode code point (such as latin-1, also named iso-8859-1). But I suspect you don't actually care how long the text is (if anything, you'd care about the amount of bytes used in the file).

I've found a way to do it, by changing

You can only do this with literal data. Do not think in these terms. The b is part of the language syntax. It is not data.

What I mean by "I don't know the encoding", is that it's automatically generated by hashlib.pbkdf2_hmac(), and it doesn't tell me what it is. I realize now that "converted" wasn't the right word, but I also did try both str.decode() and str(byte, 'utf-8'), which both didn't help. What are some examples of file formats similar to csv and JSON that are good for byte-data? — Xarvveron, Aug 24 '21 at 23:17
You can also choose to "encode" in other ways, such as by converting each byte to a hexadecimal string representation, or to [base64](https://en.wikipedia.org/wiki/Base64), or - if you really want to waste space - to `0` and `1` symbols (binary string representation). The `.encode` and `.decode` methods, and the encoding names you can use with them, are based around the assumption that the bytes *represent* text in some way. — Karl Knechtel, Aug 24 '21 at 23:18
Sorry, I got a bit wrong initially. `pbkdf2_hmac` specfically gives you a raw `bytes` object. You "don't know the encoding" because it isn't encoded and *doesn't represent text*; it's just raw data. To create a string, you need to decode it anyway, using an encoding that is able to handle any possible input byte sequence. The simplest of these is `latin-1`, which maps each individual byte to the first 256 Unicode code points in order. Or, yes, you can use a non-text-interpretation scheme. That might be a better idea for readability, and so you don't need to worry about CSV's escaping scheme. — Karl Knechtel, Aug 24 '21 at 23:27
Ahh, okay. I was confused at first. Thanks for coming back to explain why I didn't know the encoding. — Xarvveron, Aug 24 '21 at 23:32

score 1 · Accepted Answer · answered Aug 24 '21 at 23:17

1

If you want to save bytes as a string, you should probably encode them in a format made for this like base64. This is more efficient with space than directly writing hex.

Trying to convert arbitrary bytes to an encoding like utf-8 directly will likely result in UnicodeDecodeError errors.

In your case, you could do something like:

import os, hashlib, base64

password = "top_secret"

salt = os.urandom(64)
hash = hashlib.pbkdf2_hmac(
    'sha256',
    password.encode('utf-8'),
    salt,
    1000000
)
storage = salt + hash

# convert to a base64 string:
s = base64.b64encode(storage).decode('utf-8')

print(s) # <-- string you can save this to a file

# after reading it back from a file convert back to bytes
the_bytes = base64.b64decode(s)

the_bytes == storage 
# True

answered Aug 24 '21 at 23:17

Mark

90,562
7
108
148

Thank you! Are there any good reasons to use this, rather than the built-in hex function? I assume it makes for shorter strings, as it's a higher number base. But, is there anything else? Nonetheless, I'm very grateful for your helpful answer! – Xarvveron Aug 24 '21 at 23:26
@Xarvveron as far as I know the main issue is that it's more space efficient, which is why it's common for an interchange format. – Mark Aug 24 '21 at 23:28
1

Hex, by its construction, represents 4 bits per character (because it only uses 16 distinct text characters). It will take you two characters (which will be encoded in two bytes in any normal encoding) to represent each byte of the original data. (Strings are composed of "Unicode code points", which are not the same thing as "characters"; but the difference is irrelevant in this context.) Base64, as the name suggests, uses 64 distinct text characters, and thus encodes 6 bits per character (while still using plain ASCII characters that get encoded as a single byte in every normal encoding). – Karl Knechtel Aug 24 '21 at 23:32
1

Okay, thanks. I'll probably use it, since it's efficient, and in the standard libraries. – Xarvveron Aug 24 '21 at 23:32
Thanks for that @KarlKnechtel…well said. – Mark Aug 24 '21 at 23:34
There are other options, but base64 is very popular - and very old. It was used back in the day to send binary files in places that expected all the data to be text - like via the body of an email, or on USENET. It also avoids problems with newlines, "extended ascii" not working, etc. – Karl Knechtel Aug 24 '21 at 23:34
Assuming the underlying file encoding is UTF-8, using Base64 will actually be more space efficient on average than pretending the data is latin-1 text. The reason is that UTF-8 will require two bytes for code points in the 128-255 range. On average you will use 36 bits to represent 24 bits of original data this way, vs. 32 bits using Base64. It's actually slightly worse than that, because newlines, quotation marks etc. in the resulting string need to be escaped (the `csv` module handles this automatically). – Karl Knechtel Aug 24 '21 at 23:41

score 0 · Answer 3 · answered Aug 24 '21 at 23:15

0

You could use hex. Let's get some data:

>>> import os
>>> b = os.urandom(10)
>>> b
b'\xc5\xe2{\xdf\xd2\x13\xa7\x0b\xef\x07'

As a hex string that you can write to CSV:

>>> b.hex()
'c5e27bdfd213a70bef07'

Back to bytes:

>>> bytes.fromhex(b.hex())
b'\xc5\xe2{\xdf\xd2\x13\xa7\x0b\xef\x07'

answered Aug 24 '21 at 23:15

Kelly Bundy

23,480
7
29
65

Yes, thank you!! That was extremely helpful, and now it works! If I had enough points to vote for your answer, I would. But, I did mark it as the approved answer, and I'm very grateful. I should probably spend more time reading documentation, before I come to ask for help. Thank you for spending your time to give such a great and concise answer! – Xarvveron Aug 24 '21 at 23:22

How can I effectively store binary data in a file that's in a text format like CSV?

3 Answers3