Get content encoding for an S3 text object

Question

I have a text file stored in S3. I need to use the text file's contents in a Python program.

Per Read a file line by line from S3 using boto?, I can use boto3 to get its contents:

import boto3

s3 = boto3.resource('s3')
object = s3.Object('my-bucket','my-key')
file_lines = [line.decode('utf-8') for line in object.get()['Body'].iter_lines()]

This seems to work, but I hardcoded the encoding method utf-8. The attribute object.contents_encoding is None.

What content decoding method should I use, or am I misunderstanding how to convert data in bytes to text?

You should use the content encoding that the object was encoded with. Unless you know the answer for some reason, if the uploader didn't add a ContentEncoding header, then you can only guess what encoding was used. — Anon Coward, Mar 18 '23 at 23:47
@Anon Coward, makes sense. If you create an answer, I will accept it. — mherzog, Mar 19 '23 at 01:14

score 2 · Accepted Answer · answered Mar 19 '23 at 16:12

When a file is uploaded to S3, the binary representation of the file is stored in S3. S3 has no idea what it contains. If the file is uploaded with the addition of the charset field in the Content-Type header looking something like text/plain; charset=utf-8, then you can assume the contents of the object were encoded with the specified encoding. For many reasons, this header is advisorial only, some clients will ignore it and make their own assumptions. S3 will not validate the encoding used, meaning it could be wrong.

If that header is not present, then there are no guarantees about the content encoding used, and if you don't know the content encoding used through some other means, you pretty much need to guess.

How you guess depends on your exact situation. It is common to use some version of a charset detection algorithm, such as Mozilla's Charset Detector used by Firefox, or Google's Compact Enc Dec used by Chromium. When using Python, one solution is to use chardet, a Python port of Mozilla's solution. Or, some solutions will just assume it's encoded with UTF-8 and fail on errors. It all depends on the exact scenario, and how permissive you want to be, and how likely it is to encounter different encodings in the target situation.

Get content encoding for an S3 text object

1 Answers1