0

I have a requirement to create an excel sheet with multiple tabs. I could do it with pandas and openpyxl

def write_to_excel(file,sheet,dfM,mode='a') :
    # Remove dup columns from dataframe if any.
    dfM=dfM.loc[:,~dfM.columns.duplicated()]
    totRows=len(dfM.index)
    totCols=len(dfM.columns)

    # Split sheets if more than 100K
    rows_per_sheet=100000
    number_of_sheets=math.floor(totRows/rows_per_sheet)+1
    start_index=0
    end_index=rows_per_sheet

    for sheet_num in range(number_of_sheets):
        df=dfM.iloc[start_index:end_index] # Splitting based on index
        sheet_nme=sheet + '_' + str(sheet_num+1) # Adding the sheet number to the sheet name

        # Create a Pandas Excel writer using XlsxWriter as the engine.
        if sheet_num > 0 : mode='a' #Only first sheet needed mode='w'
        writer=pd.ExcelWriter(file,engine='openpyxl',mode=mode,optimized_write=True,write_only=True)

        # Convert the dataframe to an XlsxWriter Excel object.
        df.to_excel(writer,sheet_name=sheet_nme,index=False)

        # Get the xlsxwriter workbook and worksheet objects.
        worksheet = writer.sheets[sheet_nme]

        # loop1 to 4
        # Start iterating through the columns and the rows to apply the format for header alone
##### background Coloring & Bolding of Header row for some columns ######
##### Formatting of every column and background coloring of some particular columns ######
##### font & backgorund coloring of the last summary row ######

        # Finally write the file
        writer.save()
        writer.close()
        start_index=end_index
        end_index=end_index + rows_per_sheet
        df='' # clearing memory
    
    # Cleaning df those are no loner needed
    dfM='' # clearing memory
    df_format='' # clearing memory
    
    return True

But problem here is that I am dealing with very large files like a million rows. And it crashes due to memory issue.

Here one of my tab will not exceed 100K rows and I am writing them at once and closing the writer. Then, starting with the new tab in the same excel sheet.

So, I am in search of a faster and memory effective method to achieve this. I am thinking of converting this to pyexcelerate to speed it up, but still not sure what to do with memory issue.

Any thoughts ?

I tried to write separate sheets and finally cobine them to an excel sheet. But no libraries in python which supports the combining of sheets from multiple excel workbooks that run on linux available right now.

I may have some 500K rows and some 275 columns in excel sheet Columns include both numeric and text columns. It works fine until 100K or so But suck after that.

I am trying with pyexcelerate right now to know , whether it will work.

Johnson Francis
  • 249
  • 3
  • 17
  • 1
    Use openpyxl or xlsxwriter directly in their optimised modes. – Charlie Clark Sep 02 '22 at 13:38
  • 3
    If you need a million rows for the Excel spreadsheet, it's time for a database + a visualization tool like PowerBI. Even if you can create it, the users working with the Excel file on a shared drive are not going to have a good experience. – Code Different Sep 02 '22 at 13:42
  • @CodeDifferent : I agree to your point. But stakeholders are the boss. So, I need to do something to improve. it. currently it crashes for a 400K even. – Johnson Francis Sep 02 '22 at 13:55
  • 1
    @CharlieClark : Tried adding optimized_write=True, write_only=True. It improved the speed a little bit. But still no much luck on memory. – Johnson Francis Sep 03 '22 at 01:12
  • Have some time on the weekend and I take a stab at this. Can you tell me some info about your data frame: number of rows and columns, data types for each column, the kind of formatting you are doing to the output, etc. – Code Different Sep 03 '22 at 11:45
  • You ignored the point about working with the libraries directly. – Charlie Clark Sep 05 '22 at 11:00
  • @CodeDifferent : Thanks, I have updated question with more details – Johnson Francis Sep 08 '22 at 09:46

0 Answers0