Find matches between list and excel entries and mark them

Question

I have a list containing a variety of document names which may look like this:

C:\folder\somepath\1234_456_2.pdf
C:\folder\somepath\whatever\5932194_123.pdf
C:\folder\somepath\2022_10_10_5932194_123.pdf
C:\folder\somepath\January\123_5932192.pdf
C:\folder\somepath\whatever\123_59321911_1234.pdf
C:\folder\somepath\whatever\123_5932197.pdf
...

The list isn't excessively large, containing ~3000 entries.

In my excel file I have a column containing a bunch of values, about 1 million in total. If the value in the cell of that column exists within the string, the background color of the entire row is to be coloured green.

I tried solving this with openpyxl, which worked to a certain extent.

for col in sheet.columns:
    column = get_column_letter(col[0].column)
    if sheet[column + str(1)].value == "Column I am looking for":
        for j in range(1, sheet.max_row):
            if str(sheet[column + str(j)].value) in str(the_list):
                 column2 = get_column_letter(col[0].column)
                 sheet[column2 + str(j)].fill = PatternFill("solid", fgColor="92D050")

It works decently well on smaller files but on larger it just takes unbearably long. I have no idea how to implement something similar in pandas and I do not know how I can make this run better using openpyxl. How can I fix this?

Edit: I forgot to add what a column in the excel file may look like.

Some Col.    Other Col.    Relevant Col.
asdf         1111          5932194
fdsa         3214          5342133

if str(sheet[column + str(j)].value) in str(the_list): Since the column with the relevant value is an int, I had to convert it to a string.

I found that once the excel file grows a bit too large, each individual cell check occurs rather slowly.

Four nested for loops? There's your problem. Read the openpyxl more carefully and you should be able to improve things: APIs exist for a reason. — Charlie Clark, Aug 11 '22 at 16:08
@CharlieClark I have already put some work into it and I came to no conlcusion how to do it any other way and reading the documentation for openpxyl yielded nothing. — Squary94, Aug 15 '22 at 07:05
Will the column you are looking for always be the same? Across workbooks or across each item in the list you are checking? — Alan, Aug 15 '22 at 13:07
@Alan It is guarenteed to always have the same name which is why I am iterating through the columns to find it. However, it could be assumed to be always in the same column. — Squary94, Aug 15 '22 at 13:22
@Vitalizzare I added a example as a edit. As for why I convert the list to `str()`, if I didn't do that, it wouldn't find any match it is supposed to. — Squary94, Aug 16 '22 at 08:01
You didn't provide a full example, so just some tips: Trying to learn enough pandas to import the excel file and to locate the column makes sense. Then you can create a list of row numbers that need coloring and then later color these in one go. Also: You didn't mention the file size of the excel file with the million rows. Usually Excel is very slow on large file sizes so this may be a wrong approach to store the date in Excel in the first place — 576i, Aug 16 '22 at 08:15
@Vitalizzare it is not necessarily the first as it could be different with some other numbers being before it with only a `_` seperating them. — Squary94, Aug 16 '22 at 08:53
@Vitalizzare I clarified it as a edit in the main post a bit more. There is a main directory in which the files are. They may be in a subfolder such as January but it is still part of a main directory. The list already gets filtered by a significant amount after initial creation so there is little optimization that can be done there. — Squary94, Aug 16 '22 at 09:07

Romain Capron · Answer 1 · 2022-08-16T09:07:34.503

Explanation

The more often you will access the Excel file, the longer the process will take. In your example, you access the file (way) too many times. This is why this is so slow.

Firstly, do not iterate on all columns until you reach the one you need. You should directly start at the proper column.

Secondly, instead of accessing the file at each cell (1 million times in your example), you can retrieve this whole list as a pandas dataframe in a single access.

Finally, pandas is not designed to iterate on large dataframes. Avoid iterating when a faster solution is possible. Instead of iterating on the dataframe to find matching values, you can filter to retrieve them directly. For more information see How to iterate over rows in a DataFrame in Pandas

Unfortunately, it is not possible to set all the cells in green in a single call as there can be some white cells in between. So you will have to iterate to change the background colours. For more information, see How to get/set different colours of the same range from an Excel file using xlwings in python?

Example 1

I will start with a simple example that will exactly match the values of the list. If we want to inspect the column C of this Excel file:

And compare it with this list of matching values:

list_of_values = ["Matching 1", "Matching 2", "Matching 3"]

The following code will set the matching values to green.

import xlwings as xw

# Define the RGB code of the color green
green = (226, 239, 218)

# Define the matching values
list_of_values = ["Matching 1", "Matching 2", "Matching 3"]

# Connect to the example Excel file
wb = xw.Book('Test.xlsx')
sht = wb.sheets['Sheet1']
column_to_inspect = 'C'

# Retrieve the values of the column to inspect
df = sht.range('{column}1:{column}7'.format(column=column_to_inspect)).options(pd.DataFrame, index=False, header=True).value

# Set in green the matching values
for i in df[df['Path'].isin(list_of_values)].index:
    # +2 is needed as you skip the Header and the index start iterating at 0, excel starts at 1. Increase this value if your first row is not 2.
    sht["{column}{row}".format(column=column_to_inspect, row=i+2)].color = green

Example 2

Finally, this example should be really similar to what you need as it is based on a list of substrings.

import xlwings as xw

# Define the RGB code of the color green
green = (226, 239, 218)

# Define the matching values
list_of_substrings = ["USA", "UK", "Japan"]

# Connect to the example Excel file
wb = xw.Book('Test.xlsx')
sht = wb.sheets['Sheet1']
column_to_inspect = 'C'

# Retrieve the values of the column to inspect
df = sht.range('{column}1:{column}7'.format(column=column_to_inspect)).options(pd.DataFrame, index=False, header=True).value

# Set in green the cells that contain a substring
for i in df[df['Path'].str.contains('|'.join(list_of_substrings))].index:
    # +2 is needed as you skip the Header and the index start iterating at 0, excel starts at 1. Increase this value if your first row is not 2.
    sht["{column}{row}".format(column=column_to_inspect, row=i+2)].color = green

More information on How to test if a string contains one of the substrings in a list, in pandas?

Alan · Answer 2 · 2022-08-16T13:19:14.700

I second Romain's comments that you should limit the amount of times the file is read, and iteration is bad if you can use set operations.

In my opinion you can set the row colour in an operation and you don't need to drop down into xlwings to do it.

I will set out some examples below to explain different approaches:

Option 1 - Iteration

import numpy as np
import pandas as pd

# Set up the requirements for the row to be coloured
# this will make more sense later
def color(row):
    if row["check"] == "matched":
        return ['background-color: red'] * len(row)
    return [''] * len(row)

# Note this are raw strings to handle the Windows backslash path character
values_to_check = [r'C:\folder\somepath\1234_456_2.pdf', r'C:\folder\somepath\whatever\5932194_123.pdf']

df = pd.read_excel('data.xlsx', sheet_name='My Data')
# Add a blank column as a placeholder
df["check"] = ""

for i  in range(len(df)):
    # this tests if any of the entries in the file list match the current record
    if any(df.loc[i, "value"] in x for x in values_to_check):
        df.loc[i, "check"] = "matched"
    else:
        df.loc[i, "check"] = "not matched"

# now we can apply the colour option

# associate a styler object with the dataframe
styler = df.style

# apply the colour function to select and change the rows
styler.apply(color, axis=1)

# use ExcelWriter rather than using to_Excel directly in order to give access to the append & replace functions
with pd.ExcelWriter("data.xlsx", engine="openpyxl", mode="a", if_sheet_exists="replace") as writer:
    styler.to_excel(writer, 'My Data', index=False)

This gives an output with an additional column flagging whether or not it is matched.

Option 2 - Set operations (Pandas merge)

import numpy as np
import pandas as pd
import pathlib

def color_joined(row):
    if row["_merge"] == "both":
        return ['background-color: red'] * len(row)
    return [''] * len(row)

def clean_inputs(input_item:str) -> str:
    # Using PureWindowsPath vs Path to handle the backslashes
    # stem returns the filename only, no path or extension
    # get rid of the underscores to apply int comparisons based on your comments
    return int(pathlib.PureWindowsPath(input_item).stem.replace('_',''))

values_to_check = [r'C:\folder\somepath\1234_456_2.pdf', r'C:\folder\somepath\whatever\5932194_123.pdf']

# Let's have only the filenames, and without the underscores, as int
# you may need to fiddle with this a bit to match your real-world data
cleaned_filenames = [ clean_inputs(x) for x in values_to_check ]

# No need to invent a blank check column here
df = pd.read_excel('data.xlsx', sheet_name='My Data')

# Instead, convert the value list into a dataframe too
lookup_list = pd.DataFrame(cleaned_filenames, columns=['value'])

# this uses a left join and leaves a flag 
joined_df = df.merge(lookup_list, on='value', how='left', indicator=True)
# the result is a df with all of the records, plus a column called "_merge"
# the values of this column will be either "left_only" for no match or "both" for a match

styler = joined_df.style
styler.apply(color_joined, axis=1)
# Drop the _merge column by writing out only the specified columns
with pd.ExcelWriter("output.xlsx", engine="openpyxl", mode="a", if_sheet_exists="replace") as writer:
    styler.to_excel(writer, 'Merged', index=False, columns=['title', 'description', 'value', 'extra_column'])

This gives the same output as above. In theory it should be more optimised than simply using multiple loops, but, as always, you should test performance on your specific data.

Notes:

If you wanted a list of only the matches, using option 2 this could be done by s1 = pd.merge(df, lookup_list, how='inner', on=['value']).

In theory, you should be able to drop the columns before writing to Excel by using styler.hide(subset=['_merge', 'check'], axis="columns"); however, I couldn't get this to work in my test. See styler.hide documentation for details.

You can save memory (and speed up processing) by specifying the data type of the columns (e.g. int vs dtype) as the default is to use a dtype object.

score 1 · Accepted Answer · answered Aug 22 '22 at 05:50

Solve with mini data pipeline

Convert xls to csv [a fast library should exist, or excel itself]
Load csv [fast]
Find matches and save as a list of cells to mark ["C5","C768576",...] [fast]
Update xls by updating fill color of cells_to_mark list. [faster than iterating through entire column]

Find matches between list and excel entries and mark them

3 Answers3

Solve with mini data pipeline