I need to log important information in a human readable format for an app.
The three biggest requirements are as follows:
- The info should be logged in a single CSV/excel file.
- A human should be able to read the CSV/excel file efficiently, which implies that important rows should be colored.
- It should be fast and memory efficient
I have found many ways of satisfying requirement 1, 2 and 3. But never all at the same time.
For point 1, I can use Append existing excel sheet with new dataframe using python pandas or How to add pandas data to an existing csv file?
For point 2, I can use Color formatting excel file row in python
If I want to accomplish point 1 and 2 at the same time, I can use a slightly modified version of Append existing excel sheet with new dataframe using python pandas (by using df.style.apply
right before the to_excel()
call). But since the excel file is read at each iteration, the result is slow.
Similarly, How to add pandas data to an existing csv file? satisfies point 1 and 3, but not point 2, as it would seem like colors do not work in the same way for to_csv()
(at least not when saving to a file). to_excel()
, on the other hand, colors the rows correctly but doesn't take the mode='a'
argument which gives to to_csv()
its efficiency.
Here is a minimum reproducible example corresponding to point 1 and 2 at the same time. As you will see if you run the code, the time taken per iteration quickly increases.
from os.path import isfile
import pandas as pd
from openpyxl import load_workbook
from time import time
def highlight_idx(x, idx):
if x.name == idx:
return ['background-color: red'] * len(x)
return [''] * len(x)
def append_df_to_excel(filename, df, sheet_name='Sheet1', selected_idx=None, startrow=None,
truncate_sheet=False,
**to_excel_kwargs):
# Excel file doesn't exist - saving and exiting
if not isfile(filename):
df.style.apply(lambda x: highlight_idx(x, selected_idx), axis=1).to_excel(
filename,
sheet_name=sheet_name,
startrow=startrow if startrow is not None else 0,
**to_excel_kwargs)
return
# ignore [engine] parameter if it was passed
if 'engine' in to_excel_kwargs:
to_excel_kwargs.pop('engine')
writer = pd.ExcelWriter(filename, engine='openpyxl', mode='a')
# try to open an existing workbook
writer.book = load_workbook(filename)
# get the last row in the existing Excel sheet
# if it was not specified explicitly
if startrow is None and sheet_name in writer.book.sheetnames:
startrow = writer.book[sheet_name].max_row
# truncate sheet
if truncate_sheet and sheet_name in writer.book.sheetnames:
# index of [sheet_name] sheet
idx = writer.book.sheetnames.index(sheet_name)
# remove [sheet_name]
writer.book.remove(writer.book.worksheets[idx])
# create an empty sheet [sheet_name] using old index
writer.book.create_sheet(sheet_name, idx)
# copy existing sheets
writer.sheets = {ws.title: ws for ws in writer.book.worksheets}
if startrow is None:
startrow = 0
# write out the new sheet
df.style.apply(lambda x: highlight_idx(x, selected_idx), axis=1).to_excel(writer, sheet_name,
startrow=startrow,
**to_excel_kwargs)
# save the workbook
writer.save()
for i in range(100000):
df = pd.DataFrame({'A': ['2', '3', '4'],
'B': ['5', '6', '7']})
start = time()
append_df_to_excel("out.xlsx", df, selected_idx=1)
print("time taken for iteration:", time() - start)
I have tried many other ways of doing things. But as I said earlier, I haven't been able to satisfy all 3 points so far.