Efficiently stylizing and appending new dataframes to existing csv

Question

I need to log important information in a human readable format for an app.

The three biggest requirements are as follows:

The info should be logged in a single CSV/excel file.
A human should be able to read the CSV/excel file efficiently, which implies that important rows should be colored.
It should be fast and memory efficient

I have found many ways of satisfying requirement 1, 2 and 3. But never all at the same time.

For point 1, I can use Append existing excel sheet with new dataframe using python pandas or How to add pandas data to an existing csv file?
For point 2, I can use Color formatting excel file row in python

If I want to accomplish point 1 and 2 at the same time, I can use a slightly modified version of Append existing excel sheet with new dataframe using python pandas (by using df.style.apply right before the to_excel() call). But since the excel file is read at each iteration, the result is slow.
Similarly, How to add pandas data to an existing csv file? satisfies point 1 and 3, but not point 2, as it would seem like colors do not work in the same way for to_csv() (at least not when saving to a file). to_excel(), on the other hand, colors the rows correctly but doesn't take the mode='a' argument which gives to to_csv() its efficiency.

Here is a minimum reproducible example corresponding to point 1 and 2 at the same time. As you will see if you run the code, the time taken per iteration quickly increases.

from os.path import isfile

import pandas as pd
from openpyxl import load_workbook
from time import time


def highlight_idx(x, idx):
    if x.name == idx:
        return ['background-color: red'] * len(x)
    return [''] * len(x)


def append_df_to_excel(filename, df, sheet_name='Sheet1', selected_idx=None, startrow=None,
                       truncate_sheet=False,
                       **to_excel_kwargs):
    # Excel file doesn't exist - saving and exiting
    if not isfile(filename):
        df.style.apply(lambda x: highlight_idx(x, selected_idx), axis=1).to_excel(
            filename,
            sheet_name=sheet_name,
            startrow=startrow if startrow is not None else 0,
            **to_excel_kwargs)
        return

    # ignore [engine] parameter if it was passed
    if 'engine' in to_excel_kwargs:
        to_excel_kwargs.pop('engine')

    writer = pd.ExcelWriter(filename, engine='openpyxl', mode='a')

    # try to open an existing workbook
    writer.book = load_workbook(filename)

    # get the last row in the existing Excel sheet
    # if it was not specified explicitly
    if startrow is None and sheet_name in writer.book.sheetnames:
        startrow = writer.book[sheet_name].max_row

    # truncate sheet
    if truncate_sheet and sheet_name in writer.book.sheetnames:
        # index of [sheet_name] sheet
        idx = writer.book.sheetnames.index(sheet_name)
        # remove [sheet_name]
        writer.book.remove(writer.book.worksheets[idx])
        # create an empty sheet [sheet_name] using old index
        writer.book.create_sheet(sheet_name, idx)

    # copy existing sheets
    writer.sheets = {ws.title: ws for ws in writer.book.worksheets}

    if startrow is None:
        startrow = 0

    # write out the new sheet
    df.style.apply(lambda x: highlight_idx(x, selected_idx), axis=1).to_excel(writer, sheet_name,
                                                                                       startrow=startrow,
                                                                                       **to_excel_kwargs)

    # save the workbook
    writer.save()


for i in range(100000):
    df = pd.DataFrame({'A': ['2', '3', '4'],
                       'B': ['5', '6', '7']})
    start = time()
    append_df_to_excel("out.xlsx", df, selected_idx=1)
    print("time taken for iteration:", time() - start)

I have tried many other ways of doing things. But as I said earlier, I haven't been able to satisfy all 3 points so far.

Try doing all dataframe manipulations and then once you are done save to excel. Most scripts get high execution time due to I/O on disk. Just try to call "read" & "write" excel as minimum as possible and you will achieve that by bulk editing data frame, then saving. — SinkingTitanic, Apr 30 '21 at 15:57
@SinkingTitanic this is what I would usually do. But since this is for a log file, I need to write to disk every time the function is called. Maybe I could load the excel file once and then keep it in memory and write-through to disk. Not sure this is feasible with pandas though. — Skum, Apr 30 '21 at 16:01
Just load the excel into Pandas DataFrame, keep adding to the DataFrame and after some fixed time interval (say after every 30 minutes?), write that DataFrame to excel file. You will need to use a thread for the function that calls write to excel every 30 mins. Like this you will only use I/O every 30 mins rather than every time you edit the DataFrame — SinkingTitanic, Apr 30 '21 at 16:06
I just opened an issue on the official pandas repo in order to ask if what I was attempting to do was possible. According to one of the devs, it would seem like it isn't, so it seems like your solution is the best possible alternative right now. Thanks for suggesting it — Skum, May 03 '21 at 12:51

Efficiently stylizing and appending new dataframes to existing csv

0 Answers0