Best search algorithm to find 'similar' strings in excel spreadsheet

Question

I am trying to figure out the most efficient way of finding similar values of a specific cell in a specified column(not all columns) in an excel .xlsx document. The code I have currently assumes all of the strings are unsorted. However the file I am using and the files I will be using all have strings sorted from A-Z. So instead of doing a linear search I wonder what other search algorithm I could use as well as being able to fix my coding eg(binary search etc).

So far I have created a function: find(). Before the function runs the program takes in a value from the user's input that then gets set as the sheet name. I print out all available sheet names in the excel doc just to help the user. I created an empty array results[] to store well....the results. I created a for loop that iterates through only column A because I only want to iterate through a custom column. I created a variable called start that is the first coordinate in column A eg(A1 or A400) this will change depending on the iteration the loop is on. I created a variable called next that will get compared with the start. Next is technically just start + 1, however since I cant add +1 to a string I concatenate and type cast everything so that the iteration becomes a range from A1-100 or however many cells are in column A. My function getVal() gets called with two parameters, the coordinate of the cell and the worksheet we are working from. The value that is returned from getVal() is also passed inside my function Similar() which is just a function that calls SequenceMatcher() from difflib. Similar just returns the percentage of how similar two strings are. Eg. similar(hello, helloo) returns int 90 or something like that. Once the similar function is called if the strings are above 40 percent similar appends the coordinates into the results[] array.


def setSheet(ws):
    sheet = wb[ws]
    return sheet

def getVal(coordinate, worksheet):
    value = worksheet[coordinate].value
    return value

def similar(first, second):
    percent = SequenceMatcher(None, first, second).ratio() * 100
    return percent

def find():
    column = "A"
    print("\n")
    print("These are all available sheets: ", wb.sheetnames)
    print("\n")
    name = input("What sheet are we working out of> ")

    results = []

    ws = setSheet(name)
    for i in range(1, ws.max_row):
        temp = str(column + str(i))
        x = ws[temp]
        start = ws[x].coordinate
        y = str(column + str(i + 1))
        next = ws[y].coordinate
        if(similar(getVal(start,ws), getVal(next,ws)) > 40):
            results.append(getVal(start))
    return results

This is some nasty looking code so I do apologize in advance. The expected results should just be a list of strings that are "similar".

Right now you are only comparing abutted pairs. Is that your intention? You know you can have similarities further apart, right? (And they do not even have to start with the same letter, and can still have > 90 % similaritiy.) — JohanL, Mar 27 '19 at 05:47
@JohanL I know 40% will probably catch too much data with the current similar function as was my intention my focus lies more on the efficiency of the search not so much on accuracy just yet, because its just checking if these are potentially similar. I know I'm comparing abutted pairs any suggestion on improving the comparison or search functionality. — Gary Frederick, Mar 27 '19 at 06:03
@GaryFrederick: Read about [`Worksheet.iter_xxx`](https://openpyxl.readthedocs.io/en/stable/tutorial.html#accessing-many-cells) — stovfl, Mar 27 '19 at 08:19
Possible duplicate of [Openpyxl optimizing cells search speed](https://stackoverflow.com/questions/34054272/openpyxl-optimizing-cells-search-speed) — Charlie Clark, Mar 27 '19 at 08:42
@CharlieClark not really what I was looking for. Again that example is iterating over the entire excel spreadsheet. In my case 1 column from the excel spreadsheet is equivalent to a single list of values. I just happen to be using an excel document to be pulling the data. — Gary Frederick, Mar 27 '19 at 16:03
I do not know Openpyxl, but if all you need is to find a value in a list, is it not simple and efficient enough to just copy the column to a python list and filter it in linear time? Unless the time to compare two values is large, or if you have billions of entries, I do not see why you need binary search or other specialties. — Inon Peled, Mar 27 '19 at 16:41
The principle is the same whether you do a whole worksheet or just a column: create a dictionary (the clue is in the name of the data structure) for your lookups. This **is** the most efficient approach. — Charlie Clark, Mar 27 '19 at 16:59

Best search algorithm to find 'similar' strings in excel spreadsheet

0 Answers0