Python Linear Search Better Efficiency

Question

I've got a question regarding Linear Searching in Python. Say I've got the base code of

for l in lines:
  for f in search_data:
     if my_search_function(l[1],[f[0],f[2]]):
        print "Found it!"
        break

in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:

def my_search_function(search_key, search_values):
   for s in search_values:
      if search_key in s:
         return True
    return False

Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.

for line in lines:
    negative_index = -1
    positive_index = 0
    middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
    found = False

    while positive_index < middle_element:
        # print str(positive_index)+","+str(negative_index)
        if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
            print "Found it!"
            break
        positive_index = positive_index +1
        negative_index = negative_index -1

However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!

Are you looking for prefix matches (match at the start of string), like your example, or substring matches? — Janne Karila, Oct 20 '15 at 14:59

Janne Karila · Answer 1 · 2015-10-21T16:58:43.173

Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.

Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.

Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)

Code to illustrate the second approach:

lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']

try:
    iter_keys = iter(sorted(search_keys))
    key = next(iter_keys)
    for line in sorted(lines):
        if line.startswith(key):
            print('Line {} matches {}'.format(line, key))
        else:
            while key < line[:len(key)]:
                key = next(iter_keys)
except StopIteration: # all keys processed
    pass

score 0 · Answer 2 · answered Oct 20 '15 at 13:19

0

Depends on problem detail.

For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.

Filling the hashtable is pseudo-linear.

answered Oct 20 '15 at 13:19

Pieter21

1,765
1
10
22

Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. – AndrewSmiley Oct 20 '15 at 13:27
@AndrewSmiley: A trie is the standard data structure here. – Charles Oct 20 '15 at 13:30
@Charles You're right. I didn't catch that before. However, I know the Complexity of a Trie is O(L*W), but I would like better in my code. – AndrewSmiley Oct 20 '15 at 13:35

score 0 · Accepted Answer · edited May 23 '17 at 11:52

Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were

def binary_search(master_row, source_data,master_search_index, source_search_index):
    lower_bound = 0
    upper_bound = len(source_data) - 1
    found = False
    while lower_bound <= upper_bound and not found:
        middle_pos = (lower_bound + upper_bound) // 2

        if source_data[middle_pos][source_search_index] < master_row[master_search_index]:

            if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
                return {"result": True, "index": middle_pos}
                break
            lower_bound = middle_pos + 1

        elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :

            if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
                return {"result": True, "index": middle_pos}
                break
            upper_bound = middle_pos - 1
        else:

            if len(source_data[middle_pos][source_search_index]) > 5:

                return {"result": True, "index": middle_pos}
            else:
                break

and then where we actually make the Binary Search call

#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
    for m in master_search_columns:
        found = False
        for d in data_search_columns:
            data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
            results = binary_search(line, data_copy,m, d)
            found = results["result"]
            if found:
                line = update_row(line, data_copy[results["index"]], column_mapping)
                found_count = found_count +1
                break
        if found:
            break

Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray

Python Linear Search Better Efficiency

3 Answers3