When a file is uploaded to S3, the binary representation of the file is stored in S3. S3 has no idea what it contains. If the file is uploaded with the addition of the charset
field in the Content-Type
header looking something like text/plain; charset=utf-8
, then you can assume the contents of the object were encoded with the specified encoding. For many reasons, this header is advisorial only, some clients will ignore it and make their own assumptions. S3 will not validate the encoding used, meaning it could be wrong.
If that header is not present, then there are no guarantees about the content encoding used, and if you don't know the content encoding used through some other means, you pretty much need to guess.
How you guess depends on your exact situation. It is common to use some version of a charset detection algorithm, such as Mozilla's Charset Detector used by Firefox, or Google's Compact Enc Dec used by Chromium. When using Python, one solution is to use chardet, a Python port of Mozilla's solution. Or, some solutions will just assume it's encoded with UTF-8 and fail on errors. It all depends on the exact scenario, and how permissive you want to be, and how likely it is to encounter different encodings in the target situation.