Find indicies of string in list of strings for an unsorted list

Question

Problem Statement

If you are given a string lets call it target, can you give me back ALL indicies in which target is found in some input list of strings (this is important, it is a list of strings not numbers), lets call it input_list. Example inputs:

target = '1234'
input_list = [str(x) for x in range(30000)] + [str(x) for x in range(30000)]

You can not assume that the input_list is sorted if you want to sort the list you will need to add that to your own version of benchmarkFind(). The simple solution is to just do the following but this can be very inefficient:

def benchmarkFind(target,input_list):
    out = []
    for i in range(len(input_list)):
        if input_list[i] == target:
            out.append(i)
    return out

The answer here would be:

idx = [1234, 31234]

Benchmark Results

>>> %timeit benchmarkFind(target,input_list)
100 loops, best of 3: 3.07 ms per loop

User Comment Results

From @trincot and @Abhinav Sood - Slightly better but not great.

def enumerateFind(target,input_list):
    return [i for i, e in enumerate(input_list) if e == target]

>>> %timeit enumerateFind(target,input_list)
100 loops, best of 3: 2.96 ms per loop

From @B. M - This looks to be the best answer thus far!

def primitiveFind(target ,input_list):
    try :
        l=[]
        u=-1
        while True:
            u = input_list.index(target,u+1)
            l.append(u)
    except ValueError:
        return l

>>> %timeit primitiveFind(target,input_list)
1000 loops, best of 3: 577 µs per loop

If you are going to return the whole list then there is no advantage in creating a generator. Did you try `[i for i, val in enumerate(input_list) if val == target]`? — trincot, Nov 27 '18 at 19:33
@trincot yes. Although it is true that this is usually faster, and it is, not by much. I have updated the problem statement to clear things up and provided an example with your suggestion. — walsha2, Nov 28 '18 at 00:57
Possible duplicate of [Python Fastest way to find Indexes of item in list](https://stackoverflow.com/questions/35640780/python-fastest-way-to-find-indexes-of-item-in-list) — trincot, Nov 28 '18 at 07:30
Timing results really depend on how sparse the list is with respect to the searched value. See also [this answer and comments](https://stackoverflow.com/a/18669080/5459839). Any way this has been asked before. — trincot, Nov 28 '18 at 07:46

score 1 · Answer 1 · answered Nov 27 '18 at 19:54

1

Enumeration is way faster:

python -m timeit -s "\
    target = '1234';\
    input_list = [str(x) for x in range(30000)] + [str(x) for x in range(30000)];\
    idx = [i for i, e in enumerate(input_list) if e == target]"

100000000 loops, best of 3: 0.00582 usec per loop

answered Nov 27 '18 at 19:54

Abhinav Sood

799
6
23

@Abbhinav I am not sure where you are seeing that kind of performance, but that is not right> please see my updated "User Comment Results" section in my post. Enumeration is generally faster yes, but not by much. – walsha2 Nov 28 '18 at 00:59
@walsha2 I am running this on Python 3.6.5 from Anaconda (with GCC 7.2.0) on a RHEL Linux machine with 24G memory. I ran this again, and still see roughly the same time: `100000000 loops, best of 3: 0.00585 usec per loop` – Abhinav Sood Nov 28 '18 at 18:48
Still can not recreate this with similar machine. That time can not be right. Plus that time is inconsistent with everything that was reported above and in other questions. I think the correct answer is still what B.M. posted. – walsha2 Nov 29 '18 at 01:05
That's so weird, but great looking at the comparisons between different techniques. Also, I would recommend using the same number of iterations for all possible solutions that you're evaluating. – Abhinav Sood Nov 29 '18 at 15:00

B. M. · Accepted Answer · 2018-11-28T07:13:57.463

Python loops are known to be slow, but primitive on list are fast. list.index is fast.

def find2(target ,input_list):
    try :
        l=[]
        u=-1
        while True:
        u= input_list.index(target,start=u+1)
        l.append(u)
    except ValueError:
        return l

runs:

In [32]: find2(target,input_list)
Out[32]: [1234, 31234]

In [33]: %timeit find2(target,input_list)
2.8 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [34]: %timeit benchmarkFind(target,input_list)
12 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [35]: %timeit [i for i, e in enumerate(input_list) if e == target]
14.2 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4x faster on my computer this morning.

EDIT

For further tuning, data alignement is important, numpy arrays is a good way to achieve that. Unfortunately, the conversion from list is costly, so this make sense if the data can be supplied in array form :

input_arr=np.array(input_list)

(input_arr==target).nonzero()
(array([ 1234, 31234], dtype=int64),)

%timeit input_arr=np.array(input_list)
10.6 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit (input_arr==target).nonzero()
1.56 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Good comment, yea can never stray too far from the primitives! I provided a working example of your comment above and it does look to be the fasted approach so far! Please see my updated "User Comment Results" section in my post. — walsha2, Nov 28 '18 at 01:01

Find indicies of string in list of strings for an unsorted list

Problem Statement

Benchmark Results

User Comment Results

2 Answers2