2

Hope you can help me.

I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.

The code in question is this:

df_list = []    
for file in location:
    df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))

df_concat = pd.concat(df_list)

Any suggestions?

Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated? And considering df_list is a list, can I do that conversion?

Catarina Ribeiro
  • 562
  • 2
  • 16

2 Answers2

1

Reading excel files is quite slow in pandas as you stated, you shoudld have a look at this answer. It bascally uses a vbscript before running the python script to convert excel file to csv file, which is way faster to read for the python script.

To be more specific and answer the second part of your question, you should convert teh excel files to csv before loading them with pandas. The read_excel function is the slow part.

robinood
  • 1,138
  • 8
  • 16
1

I've found a solution with xlsx2csv

xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')


for xlsx in list_of_xlsx:
    # Extract File Name on group 2 "(.+)"
    filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
    # Setup the call for subprocess.call()
    call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
    try:
        subprocess.call(call) # On Windows use shell=True
    except:
        print('Failed with {}'.format(filepath)

outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv

listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
    df = pd.read_csv(file)
    if df.shape[1] == 24: # make sure 24 columns
        listofdataframes.append(df)
    else:
        print('{}  has {} columns - skipping'.format(file,df.shape[1]))

bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)

I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?

newUser
  • 102
  • 8
  • 1
    I tried this and it gave me this error: self.ziphandle.close() AttributeError: 'Xlsx2csv' object has no attribute 'ziphandle' – Catarina Ribeiro Apr 14 '22 at 16:55
  • 1
    I read more about this xlsx2csv library and I came up with a solution that converts xlsx directly to CSV without having to open the file. `Xlsx2csv(file, outputencoding="utf-8").convert(destFile)` It's as simple as this. – Catarina Ribeiro Apr 22 '22 at 14:19