1

I'm trying to receive a CSV file from an Amazon S3 bucket like this:

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='BUCKET_NAME', Key='FILE_NAME')
data = obj['Body'].read().decode('utf-8').splitlines()

But this is what I get:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 10: invalid start byte

I thought I got the encoding of the file wrong, so I ran file FILE_NAME.csv and got UTF-8. I have also tried latin-1 and some other encodings, but all of them give me gibberish. Any help is appreciated, and thank you all!

  • Try `obj = s3.get_object(Bucket='BUCKET_NAME', Key='FILE_NAME', ResponseContentType='text/csv')` - does that work? – Ermiya Eskandary Oct 11 '21 at 17:49
  • @ErmiyaEskandary Unfortunately, it's not working. This is the error I'm getting: ```UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 10-11: invalid continuation byte``` – Nava Braulio Oct 11 '21 at 17:59
  • Is this a valid CSV file? Does it open correctly if you download it via the AWS console? – Ermiya Eskandary Oct 11 '21 at 18:03
  • @ErmiyaEskandary It's a valid CSV file. I can download it using AWSCLI and the console – Nava Braulio Oct 11 '21 at 18:06
  • And it opens up as a CSV file? – Ermiya Eskandary Oct 11 '21 at 18:07
  • @ErmiyaEskandary Yes and all the data is in there – Nava Braulio Oct 11 '21 at 18:08
  • Hmm - can you please split out the `data = obj['Body'].read().decode('utf-8').splitlines()` command and let us know the contents in `obj['Body']`, `obj['Body'].read()` and `obj['Body'].read().decode('utf-8')` separately? I feel like the `obj['Body']` might perhaps be empty? Not sure. – Ermiya Eskandary Oct 11 '21 at 18:10
  • Ah wait - what's the compression set for the CSV file in the S3 console? – Ermiya Eskandary Oct 11 '21 at 18:13
  • Also try `data = obj['Body'].iter_lines()` – Ermiya Eskandary Oct 11 '21 at 18:15
  • 1
    @ErmiyaEskandary the output of ```obj['body']```: `````` output of ```obj['body'].read()```: ```\xc7)D\x94\xcc:9\x01\xb4\x1f\xa9\x01\x93G\xb6\x84Dv\xdf\x9b\xd8\xd0uHp<\xfe\xfb\xee\xb4\xea\x83\x1\x02\x14\x00\x14\x00\x08\x08\x08\x00\xf4\x8eKS\x15+\xf8\xeew\x02\x00\x00\x``` (this is parts of the output) output of ```obj['body'].read().decode('utf-8')```: ```UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 10-11: invalid continuation byte``` – Nava Braulio Oct 11 '21 at 18:17
  • 1
    @ErmiyaEskandary Not sure how I can find out what the compression set of the file is – Nava Braulio Oct 11 '21 at 18:19
  • Does `data = obj['Body'].iter_lines()` work? - also please add the outputs to the question if you don't mind and the fact that it is indeed a valid CSV etc. as comments get lost thank you – Ermiya Eskandary Oct 11 '21 at 18:47
  • @ErmiyaEskandary Will do! This is the output for ```data = obj['body'].iter_lines()```: `````` Thank you! – Nava Braulio Oct 11 '21 at 18:55
  • Are you trying to save the file or loop over the file contents? If looping, that should work - try `for line in iterator:` `print(line)` and that will output the CSV for you – Ermiya Eskandary Oct 11 '21 at 18:57
  • 1
    @ErmiyaEskandary Either way works for me. I tried looping over the content and I just got this: ```b'hZ0' b'KRh\xc1\xb4\x10:\x1f\xe6\xb7\xc3!\xc6\xc5\xc4\x03\xc2fz>1\xc0\x9f\xa9T\x06\xe5\xed\'\xd9\xb4\x9c\xb1\xcbv\x87\x00\x08\x9bi\x91-\xac \xda\x14\xd9\x02\xfcF\xad\'\x86\x1d]-z\x06\x84\x84\xcc\x8f\xe1\xc2\xf7\xbb\x13\xe4"\xbf\xf5m\xc7\xd7UT\x0b\xb8S\xb4\x06U\x01\x12wi\xfd($\x08\xa27\x04P\x94\x18,\xe5\xb6\x1c\x9a\x93\xdc\x16\x94\x7f\x9b\xa4\xa2\xf8$\xc7\xe2\xc4`\x813S\xb7' b"?`Y\x1e'Q\x1c\xfaq\xca?\xea\xfa\xe9"``` (This is a part of the output) – Nava Braulio Oct 11 '21 at 19:08
  • Does this work? https://stackoverflow.com/a/48592700/4800344 – Ermiya Eskandary Oct 11 '21 at 19:27

1 Answers1

1

This happens if your string has non ASCII characters encoded in it and it is unable to decode with UTF-8 (may happen if needed to use other encodings). In such a scenario, you need to use the encoding windows-1252.

The below solution works fine for me. By default, it tries to decode with UTF-8 as it is recommended to use this encoding over others. If it gets an error while decoding then it tries with windows-1252.

message = 'some message with special characters'
try:
    return message.decode('utf-8')
except:
    return message.decode('windows-1252')

Now you can even encode it back with UTF-8.

Reyan Chougle
  • 4,917
  • 2
  • 30
  • 57