PANDAS & glob - Excel file format cannot be determined, you must specify an engine manually

Question

I am not sure why I am getting this error although sometimes my code works fine!

Excel file format cannot be determined, you must specify an engine manually.

Here below is my code with steps:

1- list of columns of customers Id:

customer_id = ["ID","customer_id","consumer_number","cus_id","client_ID"]

2- The code to find all xlsx files in a folder and read them:

l = [] #use a list and concat later, faster than append in the loop
for f in glob.glob("./*.xlsx"):
    df = pd.read_excel(f).reindex(columns=customer_id).dropna(how='all', axis=1)
    df.columns = ["ID"] # to have only one column once concat
    l.append(df)
all_data  = pd.concat(l, ignore_index=True) # concat all data

I added the engine openpyxl

df = pd.read_excel(f, engine="openpyxl").reindex(columns = customer_id).dropna(how='all', axis=1)

Now I got a different error:

BadZipFile: File is not a zip file

pandas version: 1.3.0 python version: python3.9 os: MacOS

is there a better way to read all xlsx files from a folder ?

I've met excatly same problem on linux and the reason is just as @pirateofebay said. There are some temporary files which are hidden. I've solved this by using pd.ExcelWriter to open all files related and then use writer.close() to close them one by one. Then we are good, you can use pd.read_excel as you wish. :D — Christy, Jul 27 '22 at 06:55

score 33 · Accepted Answer · answered Jul 22 '21 at 01:47

Found it. When an excel file is opened for example by MS excel a hidden temporary file is created in the same directory:

~$datasheet.xlsx

So, when I run the code to read all the files from the folder it gives me the error:

Excel file format cannot be determined, you must specify an engine manually.

When all files are closed and no hidden temporary files ~$filename.xlsx in the same directory the code works perfectly.

score 28 · Answer 2 · answered Sep 26 '21 at 21:15

Also make sure you're using the correct pd.read_* method. I ran into this error when attempting to open a .csv file with read_excel() instead of read_csv(). I found this handy snippet here to automatically select the correct method by Excel file type.

if file_extension == 'xlsx':
    df = pd.read_excel(file.read(), engine='openpyxl')
elif file_extension == 'xls':
    df = pd.read_excel(file.read())
elif file_extension == 'csv':
    df = pd.read_csv(file.read())

score 3 · Answer 3 · answered May 09 '22 at 16:56

3

You can filter out the unwanted temp files by checking if file starts with "~".

import os

for file in os.listdir(folder path):
    if not file.startswith("~") and file.endswith(".xlsx"):
        print(file)

answered May 09 '22 at 16:56

MinneapolisCoder9

601
1
11
29

score 2 · Answer 4 · answered Nov 24 '21 at 21:28

2

https://stackoverflow.com/a/32241271/17411729

link to an answer on how to remove hidden files

Mac = go to folder press cmd + shift + . will show the hidden file, delete it, run it back.

answered Nov 24 '21 at 21:28

TimTiger

21
3

Thank you for pointing out a potentially duplicated question. However, there are two things that you may consider: 1) to mention it only as a comment to the question, rather than an answer 2) If the solution in the SO page that you referred is not exactly the same, you should include the steps that you took too, not only the link – Aryo Pradipta Gema Nov 25 '21 at 08:34
Thanks for making me aware, I will try to keep that in mind the next time :) would love to move my answer to the comments, unfortunately, I'm not allowed to make comments until I get 50 reps. – TimTiger Dec 07 '21 at 18:13

score 2 · Answer 5 · edited Feb 05 '22 at 13:47

In macOS, an "invisible file" named ".DS_Store" is automatically generated in each folder. For me, this was the source of the issue. I solved the problem with an if statement to bypass the "invisible file" (which is not an xlsx, so thus would trigger the error)

for file in os.scandir(test_folder):
    filename = os.fsdecode(file)
    if '.DS_Store' not in filename:
        execute_function(file)

score 2 · Answer 6 · answered Mar 18 '22 at 09:25

2

In my case, I usedxlrd. So in terminal:

pip install xlrd

If pandas is not installed, install it:

pip install pandas

Now read the excel file this way:

import pandas as pd

df = pd.read_excel("filesFolder/excelFile.xls", engine='xlrd')

answered Mar 18 '22 at 09:25

Zahra

2,231
3
21
41

score 1 · Answer 7 · answered Jul 28 '21 at 03:40

1

I also got an 'Excel file format...' error when I manually changed the 'CSV' suffix to 'XLS'. All I had to do was open excel and save it to the format I wanted.

answered Jul 28 '21 at 03:40

fang aa

11
1

score 1 · Answer 8 · edited Nov 16 '21 at 22:49

1

Looks like an easy fix for this one. Go to your excel file, whether it is xls or xlsx or any other extension, and do "save as" from file icon. When prompted with options. Save it as CSV UTF-8(Comma delimited)(*.csv)

edited Nov 16 '21 at 22:49

Héctor Valverde

1,089
1
14
34

answered Nov 15 '21 at 15:44

Mo KC

31
3

PANDAS & glob - Excel file format cannot be determined, you must specify an engine manually

8 Answers8

Linked