Python - find index of unique substring contained in list of strings without going through all the items

Question

I have a question that might sound like something already asked but in reality I can't find a real good answer for this. Every day I have a list with a few thousand strings in it. I also know that this string will always contain literally one item containing the word "other". For example, one day I may have:

a = ['mark','george', .... , " ...other ...", "matt','lisa', ... ]

another day I may get:

a = ['karen','chris','lucas', ............................., '...other']

As you can see the position of the item containing the substring "other" is random. My goal is to get as fast as possible the index of the item containing the substring 'other'. I found other answers here where most of the people suggest list comprehensions of look for. for example: Finding a substring within a list in Python and Check if a Python list item contains a string inside another string They don't work for me because they are too slow. Also, other solutions suggest to use "any" to simply check if "other" is contained in the list, but I need the index not a boolean value. I believe regex might be a good potential solution even though I'm having a hard time to figure out how. So far I simply managed to do the following:

# any_other_value_available  will tell me extremely quickly if 'other' is contained in list.
any_other_value_available = 'other' in str(list_unique_keys_in_dict).lower()

from here, I don't quite know what to do. Any suggestions? Thank you

How much "too slow" are they, and how much faster do you need it? — Kelly Bundy, Mar 10 '20 at 13:13
You can't search without searching. Perhaps you can store this as you load the data structure (search once instead of many times). Also, expect some false positives. — Kenny Ostrom, Mar 10 '20 at 13:13
Hi @Angelo--from my tests (see answer below), regex produced the slowest search method. — DarrylG, Mar 10 '20 at 17:39

DarrylG · Answer 1 · 2020-03-10T19:53:25.940

Methods Explored

1. Generator Method

next(i for i,v in enumerate(test_strings) if 'other' in v)

2. List Comprehension Method

[i for i,v in enumerate(test_strings) if 'other' in v]

3. Using Index with Generator (suggested by @HeapOverflow)

test_strings.index(next(v for v in test_strings if 'other' in v))

4. Regular Expression with Generator

re_pattern = re.compile('.*other.*')
next(test_strings.index(x) for x in test_strings if re_pattern.search(x))

Conclusion

Index Method had the fastest time (method suggested by @HeapOverflow in comments).

Test Code

Using Perfplot which uses timeit

import random 
import string
import re
import perfplot

def random_string(N):
    return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))

def create_strings(length):
    M = length // 2
    random_strings = [random_string(5) for _ in range(length)]

    front = ['...other...'] + random_strings
    middle = random_strings[:M] + ['...other...'] + random_strings[M:]
    end_ = random_strings + ['...other...']

    return front, middle, end_

def search_list_comprehension(test_strings):
    return [i for i,v in enumerate(test_strings) if 'other' in v][0]

def search_genearator(test_strings):
    return next(i for i,v in enumerate(test_strings) if 'other' in v)

def search_index(test_strings):
    return test_strings.index(next(v for v in test_strings if 'other' in v))

def search_regex(test_strings):
    re_pattern = re.compile('.*other.*')
    return next(test_strings.index(x) for x in test_strings if re_pattern.search(x))

# Each benchmark is run with the '..other...' placed in the front, middle and end of a random list of strings.

out = perfplot.bench(
    setup=lambda n: create_strings(n),  # create front, middle, end strings of length n
    kernels=[
        lambda a: [search_list_comprehension(x) for x in a],
        lambda a: [search_genearator(x) for x in a],
        lambda a: [search_index(x) for x in a],
        lambda a: [search_regex(x) for x in a],
    ],
    labels=["list_comp", "generator", "index", "regex"],
    n_range=[2 ** k for k in range(15)],
    xlabel="lenght list",
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

out.show()
print(out)

Results

length list   regex    list_comp  generator    index
     1.0     10199.0     3699.0     4199.0     3899.0
     2.0     11399.0     3899.0     4300.0     4199.0
     4.0     13099.0     4300.0     4599.0     4300.0
     8.0     16300.0     5299.0     5099.0     4800.0
    16.0     22399.0     7199.0     5999.0     5699.0
    32.0     34900.0    10799.0     7799.0     7499.0
    64.0     59300.0    18599.0    11799.0    11200.0
   128.0    108599.0    33899.0    19299.0    18500.0
   256.0    205899.0    64699.0    34699.0    33099.0
   512.0    403000.0   138199.0    69099.0    62499.0
  1024.0    798900.0   285600.0   142599.0   120900.0
  2048.0   1599999.0   582999.0   288699.0   239299.0
  4096.0   3191899.0  1179200.0   583599.0   478899.0
  8192.0   6332699.0  2356400.0  1176399.0   953500.0
 16384.0  12779600.0  4731100.0  2339099.0  1897100.0

@HeapOverflow--your method is the fastest as shown in my updated post. Thanks. — DarrylG, Mar 10 '20 at 14:30

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

If you are looking for a substring, regular expressions are a good way to find it.

In your case you are looking for all substrings that contain 'other'. As you have already mentioned, there is no special order of the elements in the list. Therefore the search for your desired element is linear, even if it is ordered.

A regular expression that might describe your search is query='.*other.*'. Regarding the documentation

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

With .* before and after other there can be 0 or more repetitions of any character.

For example

import re

list_of_variables = ['rossum', 'python', '..other..', 'random']
query = '.*other.*'
indices = [list_of_variables.index(x) for x in list_of_variables if re.search(query, x)]

Which will return a list of indices containing your query. In this example indices will be [2], since '...other...' is the third element in the list.

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 10 '20 at 13:26

Alexander L

36
2

How much faster is this? – Kelly Bundy Mar 10 '20 at 13:40
Measuring the process time for this particular example yields a time of approx 0.0002113 seconds. In contrast the questioners code is 10 times faster, but does not return the indice of the searched substring, thus its not really comparable. – Alexander L Mar 10 '20 at 13:58
How about the other answer's solution then? – Kelly Bundy Mar 10 '20 at 14:01
Without further testing the other answer should be faster referring to https://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find – Alexander L Mar 10 '20 at 14:16
However what can be used instead of enumeration is list comprehension and indexing as in my answer. Just replace `re.search(query, x)` with `'other' in x`. – Alexander L Mar 10 '20 at 14:22
Thank you Alexander, this runs way faster on my end. Thank you – Angelo Mar 10 '20 at 15:09

Python - find index of unique substring contained in list of strings without going through all the items

2 Answers2