0

i'm using postman to send an excel file which i am reading in tornado.


Tornado code

self.request.files['1'][0]['body'].decode()

here if i send .csv than, the above code works.


if i send .xlsx file than i am stuck with this error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte


request.files will fetch the file but the type would be byte. so to convert byte to str i've used decode(), which works only for .csv and not for .xlsx

i tried decode('utf-8') but still no luck.

i've tried searching but didn't find any issue mentioning 0x87 problem?

Community
  • 1
  • 1

3 Answers3

1

The reason is that the .xlsx file has a different encoding, not utf-8. You'll need to use the original encoding to decode the file.

There's no guaranteed way of finding out the encoding of a file programmatically. I'm guessing you're making this application for general users and so you will keep encountering files with different and unexpected encodings.

A good way to deal with this is by trying to decode using multiple encodings, in case one fails. Example:

encodings = ['utf-8', 'iso-8859-1', 'windows-1251', 'windows-1252']

for encoding in encodings:
    try:
        decoded_file = self.request.files['1'][0]['body'].decode(encoding)
    except UnicodeDecodeError:
        # this will run when the current encoding fails
        # just ignore the error and try the next one
        pass
    else:
        # this will run when an encoding passes
        # break the loop
        # it is also a good idea to re-encode the 
        # decoded files to utf-8 for your purpose
        decoded_file = decoded_file.encode("utf8")
        break
else:
    # this will run when the for loop ends
    # without successfully decoding the file
    # now you can return an error message
    # to the user asking them to change 
    # the file encoding and re upload
    self.write("Error: Unidentified file encoding. Re-upload with UTF-8 encoding")
    return

# when the program reaches here, it means 
# you have successfully decoded the file 
# and you can access it from `decoded_file` variable

Here's a list of some common encodings: What is the most common encoding of each language?

xyres
  • 20,487
  • 3
  • 56
  • 85
  • Thanks for the idea of multiple encoding. this .xlsx decodes by 'iso-8859-1' but when i encode by utf-8 the output is just some random bytes which i don't understand. I've tried different encoding but still no luck. can you please try decoding this file? [test.xlsx](https://drive.google.com/open?id=10GDlEFpZmOSHAh-kMVEKmL2TZl4IBmkM) – Mukesh Suthar Jun 06 '18 at 11:15
  • 1
    @MukeshSuthar You're reading the data as plain string (or bytes). You need to treat the data as as xlsx format. But that would be a lot of code to write. Use a library called [openpyxl](https://openpyxl.readthedocs.io/en/stable/index.html) which will help you read the data as xlsx. – xyres Jun 06 '18 at 12:57
  • @MukeshSuthar One more thing, if you want to read the data directly from memory without saving to disk first, [see this answer](https://stackoverflow.com/a/20667890/1925257). – xyres Jun 06 '18 at 12:59
  • Is conversion to CSV an Option (maybe using a python-triggered VBA binary)? – sudonym Jun 06 '18 at 13:01
  • 1
    @xyres EQObject = self.request.files['0'][0]['body'].decode('iso-8859-1').encode('utf-8') openpyxl.load_workbook(filename=BytesIO(EQObject)). Error: zipfile.BadZipFile: Bad magic number for central directory. I dont understand zip error, i'm getting input as byte which is xlsx, and how this zip guy in between poping ambigous errors. – Mukesh Suthar Jun 06 '18 at 14:28
  • @mukesh sughar You will find some code to copy and paste on how to read xslx with python in one of my questions https://stackoverflow.com/q/49870991/6060083 – sudonym Jun 06 '18 at 14:44
  • @MukeshSuthar What is `EQObject`? Please open a new question because I don't think this `BadZipFile` error is related to your original question. – xyres Jun 06 '18 at 17:23
  • @xyres EQObject is just a variable name to store the xlsx (in byte). and for BadZipFile error ill open a new question. – Mukesh Suthar Jun 06 '18 at 17:32
1

I faced the same issue and this worked for me.

    import io
    
    df = pd.read_excel(io.BytesIO(self.request.files['1'][0]['body']))
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Marinario Agalliu Oct 14 '21 at 09:46
0

try this one, following suggestions provided here:

self.request.files['1'][0]['body'].decode('iso-8859-1').encode('utf-8')
sudonym
  • 3,788
  • 4
  • 36
  • 61
  • AttributeError: 'bytes' object has no attribute 'str' – Mukesh Suthar Jun 06 '18 at 07:12
  • try it after removing the first "str" - I have updated my answer accordingly. if this doesn't work, try to remove the second "str" while keeping the first one – sudonym Jun 06 '18 at 07:18
  • still same error AttributeError: 'bytes' object has no attribute 'str' can you tell me why do we need to encode it after decoding? – Mukesh Suthar Jun 06 '18 at 08:07
  • just remove all the 'str' - decode/encode is a common conversion flow – sudonym Jun 06 '18 at 08:24
  • removed all 'str', the code works. but the output is just some random byte formatted text. output: \x00\x11\x00\x07\x04\x00\x00\x18K\x00\x00\x00\x00' ... goes-on.. – Mukesh Suthar Jun 06 '18 at 10:54
  • Ok, so we are one step closer. Let me see what I can do, please give me some short time – sudonym Jun 06 '18 at 12:59