Pandas can not read S3 excel file. Error: Excel file format cannot be determined

Question

I am reading excel files (.xls) from S3 with Pandas. The code works properly for a few files, but for the rest not. The files are received daily with different values per day (excel files structure is the same, so we can consider the files identical).

The error is:

ValueError: Excel file format cannot be determined, you must specify an engine manually.

at this line:

pd.read_excel(io.BytesIO(excel), sheet_name=sheet, index_col=None, header=[0])

I have tried all the solutions mentioned on internet: specifying the engine='openpyxl' gives the error:

zipfile.BadZipFile: File is not a zip file

and specifying the engine='xlrd' gives the error:

expected str, bytes or os.PathLike object, not NoneType

I am using boto3 to connect to S3 resource. Once again, for a few files my code works fine. What can be the cause of this different behaviour for excel files that looks identical?

My question is very similar with Excel file format cannot be determined with Pandas, randomly happening but it doesn't have a proper response yet.

score 0 · Answer 1 · answered Dec 18 '21 at 17:41

0

It's always possible that the files you are reading have mislabeled extensions, bad data, etc.

It's also not clear how you are getting 'excel' in io.BytesIO(excel)

See if something like this will work. this is reading a .xls file. I'm able to return contents of Sheet1 to a dataframe.

bucket = 'your bucket'
key = 'test.xls'
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=key)
pd.read_excel(obj['Body'].read(), sheet_name='Sheet1', index_col=None, header=0)

answered Dec 18 '21 at 17:41

Jonathan Leon

5,440
2
6
14

Thanks for your answer, Jonathan! I am using the same code to read the excel as yours. And it is working also for me for a few files, but not for all of them. But yes, your asumption about something wrong with the .xls file is the most probable. Maybe bad content or wrong file metadata. But I don't know how to determine what's exactly wrong with those .xls files. – Trusca F Dec 18 '21 at 18:07
because i'm not using io.Bytes(IO), curious if my code gives the same errors. wasn't clear if you tried mine on all your files. – Jonathan Leon Dec 18 '21 at 20:52
Yes, I have tried your code on all of the files and the result is the same. For some files it's working, but for others not. So I have to find a way to determine what and when something went wrong with those .xls files. Most probabbly this will be a very "foggy" process :( Thanks again for your help, Jonathan! – Trusca F Dec 20 '21 at 09:02

Pandas can not read S3 excel file. Error: Excel file format cannot be determined

1 Answers1