1

Note: I don't know much about Encoding / Decoding, but after I ran into this problem, those words are now complete jargon to me.

Question: I'm a little confused here. I was playing around with encoding/decoding images, to store an image as a TextField in a django model, looking around Stack-Overflow I found I could decode an image from ascii(I think or binary? Whatever open('file', 'wb') uses as encoding. I'm assuming the default ascii) to latin1 and store it in a database with no problems.

The problem comes from creating the image from the latin1 decoded data. When attempting to write to a file-handle I get a UnicodeEncodeError saying ascii encoding failed.

I think the problem is when opening a file as binary data (rb) it's not a proper asciiencoding, because it contains binary data. Then I decode the binary data to latin1 but when converting back to ascii (auto encodes when trying to write to the file), it fails, for some unknown reason.

My guess is either that when decoding to latin1 the raw binary data get converted to something else, then when trying to encode back to ascii it can't identify what was once raw binary data. (although the original and decoded data have the same length). Or the problem lies not with the decoding to latin1 but that I'm attempting to ascii encode binary data. In which case how would I encode the latin1 data back to an image.

I know this is very confusing but I'm confused on it all, so I can't explain it well. If anyone can answer this question there probably a riddle master.

some code to visualize:

>>> image_handle = open('test_image.jpg', 'rb')
>>> 
>>> raw_image_data = image_handle.read()
>>> latin_image_data = raw_image_data.decode('latin1')
>>> 
>>> 
>>> # The raw data can't be processed by django 
... # but in `latin1` it works
>>> 
>>> # Analysis of the data
>>> 
>>> type(raw_image_data), len(raw_image_data)
(<type 'str'>, 2383864)
>>> 
>>> type(latin_image_data), len(latin_image_data)
(<type 'unicode'>, 2383864)
>>> 
>>> len(raw_image_data) == len(latin_image_data)
True
>>> 
>>> 
>>> # How to write back to as a file?
>>> 
>>> copy_image_handle = open('new_test_image.jpg', 'wb')
>>> 
>>> copy_image_handle.write(raw_image_data)
>>> copy_image_handle.close()
>>> 
>>> 
>>> copy_image_handle = open('new_test_image.jpg', 'wb')
>>> 
>>> copy_image_handle.write(latin_image_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
>>> 
>>> 
>>> latin_image_data.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
>>> 
>>> 
>>> latin_image_data.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Brandon Nadeau
  • 3,568
  • 13
  • 42
  • 65
  • I'd start out with looking up jpg on wikipedia and then plain text files. An image file wont have any plain text data that can be encoded into ascii. They are just different types of data, apples and oranges, plain text files and binary files – cameron-f Jun 21 '15 at 00:54
  • So I can decode the image data from ascii, but just not back to it? that would mean this is a one way conversion? – Brandon Nadeau Jun 21 '15 at 00:57
  • wait, when opening a file as binary, what is its encoding? I know python defaults to ascii but this is raw data correct? Shit, I can't get passed this. – Brandon Nadeau Jun 21 '15 at 00:59
  • A binary file won't have a text encoding. Format might be a better term. The binary data inside a jpg has no relation to any sort of text encoding. You can try to read a binary file as a text file. Python will read the file and display unicode characters, but it will really just be jibberish. To open a binary file you need a program that is ready to handle the file format. Microsoft Word text documents are considered binaries because they add extra formatting and you need to open the files with Word specifically. Text files can be read with generic text editors like notepad. – cameron-f Jun 21 '15 at 01:12
  • Ahh that make sense. No encoding is a perfect answer. – Brandon Nadeau Jun 21 '15 at 01:18
  • I'll type up an answer with binary files vs text files and link to another stackoverflow question about encoding – cameron-f Jun 21 '15 at 01:22

2 Answers2

4

Unlike normal/pain text files an image file does not have any encoding, the data shown is a visual representation of the binary equivalent of the image. Like @cameron-f says above in the question comments, this is basically gibberish and any encoding done will break the image file so don't try it.

But that doesn't mean all hope is lost. Here's a way I usually turn an image to a string and back to an image.

from base64 import b64decode, b64encode

image_handle = open('test_image.jpg', 'rb')

raw_image_data = image_handle.read()

encoded_data = b64encode(raw_image_data)
compressed_data = zlib.compress(encoded_image, 9) 

uncompressed_data = zlib.decompress(compressed_data)
decoded_data = b64decode(uncompressed_data)

new_image_handle = open('new_test_image.jpg', 'wb')

new_image_handle.write(decoded_data)
new_image_handle.close()
image_handle.close()


# Data Types && Data Size Analysis
type(raw_image_data), len(raw_image_data)
>>> (<type 'str'>, 2383864)

type(encoded_image), len(encoded_image)
>>> (<type 'str'>, 3178488)

type(compressed_data), len(compressed_data)
>>> (<type 'str'>, 2189311)

type(uncompressed_data), len(uncompressed_data)
>>> (<type 'str'>, 3178488)

type(decode_data), len(decode_data)
>>> (<type 'str'>, 2383864)



# Showing that the conversions were successful
decode_data == raw_image_data
>>> True

encoded_data == uncompressed_data
>>> True
Brandon Nadeau
  • 3,568
  • 13
  • 42
  • 65
1

The UnicodeEncodeError is popping up because a jpeg is a binary file and ASCII encoding is for plain text in plain text files.

Plain text files can be created with generic text editors like notepad for Windows or nano for Linux. Most will either use ASCII or Unicode encoding. When a text editor is reading an ASCII file it will grab a byte, say 01100001 (97 in dec), and find the corresponding glyph, 'a'.

So when a text editor tries to read a jpg it will grab the same byte 01100001 and get 'a', but since the file holds information for displaying a photo the text will just be jibberish. Try opening the jpeg in notepad or nano.

As for encoding here is an explanation: What is the difference between encode/decode?

Community
  • 1
  • 1
cameron-f
  • 431
  • 1
  • 3
  • 15