0

I've got a question regarding Linear Searching in Python. Say I've got the base code of

for l in lines:
  for f in search_data:
     if my_search_function(l[1],[f[0],f[2]]):
        print "Found it!"
        break

in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:

def my_search_function(search_key, search_values):
   for s in search_values:
      if search_key in s:
         return True
    return False

Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.

for line in lines:
    negative_index = -1
    positive_index = 0
    middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
    found = False

    while positive_index < middle_element:
        # print str(positive_index)+","+str(negative_index)
        if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
            print "Found it!"
            break
        positive_index = positive_index +1
        negative_index = negative_index -1

However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!

AndrewSmiley
  • 1,933
  • 20
  • 32

3 Answers3

1

Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.

Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.

Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)

Code to illustrate the second approach:

lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']

try:
    iter_keys = iter(sorted(search_keys))
    key = next(iter_keys)
    for line in sorted(lines):
        if line.startswith(key):
            print('Line {} matches {}'.format(line, key))
        else:
            while key < line[:len(key)]:
                key = next(iter_keys)
except StopIteration: # all keys processed
    pass
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
0

Depends on problem detail.

For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.

Filling the hashtable is pseudo-linear.

Pieter21
  • 1,765
  • 1
  • 10
  • 22
  • Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. – AndrewSmiley Oct 20 '15 at 13:27
  • @AndrewSmiley: A trie is the standard data structure here. – Charles Oct 20 '15 at 13:30
  • @Charles You're right. I didn't catch that before. However, I know the Complexity of a Trie is O(L*W), but I would like better in my code. – AndrewSmiley Oct 20 '15 at 13:35
0

Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were

def binary_search(master_row, source_data,master_search_index, source_search_index):
    lower_bound = 0
    upper_bound = len(source_data) - 1
    found = False
    while lower_bound <= upper_bound and not found:
        middle_pos = (lower_bound + upper_bound) // 2

        if source_data[middle_pos][source_search_index] < master_row[master_search_index]:

            if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
                return {"result": True, "index": middle_pos}
                break
            lower_bound = middle_pos + 1

        elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :

            if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
                return {"result": True, "index": middle_pos}
                break
            upper_bound = middle_pos - 1
        else:

            if len(source_data[middle_pos][source_search_index]) > 5:

                return {"result": True, "index": middle_pos}
            else:
                break

and then where we actually make the Binary Search call

#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
    for m in master_search_columns:
        found = False
        for d in data_search_columns:
            data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
            results = binary_search(line, data_copy,m, d)
            found = results["result"]
            if found:
                line = update_row(line, data_copy[results["index"]], column_mapping)
                found_count = found_count +1
                break
        if found:
            break

Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray

Community
  • 1
  • 1
AndrewSmiley
  • 1,933
  • 20
  • 32