Find list index of list items within other list items

Question

I have a list of long strings and I'd like to get the indexes of the list elements that match a substring of strings in another list. Checking if a list item contains a a single string inside a list is easy to do with list comprehensions, like this question:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
thing_to_find = "abc"
matching = [i for i, x in enumerate(my_list) if thing_to_find in x]

However, I'd like to check not only if "abc" is in x, but if any strings in another list are in the list, like so:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
things_to_find = ['abc', 'def']

This obviously doesn't work (but it would be really cool if it did):

matching = [i for i, x in enumerate(my_list) if things_to_find in x]

I can find the list indexes if I run commands individually, but it's tedious and horrible:

print([i for i, x in enumerate(my_list) if 'abc' in x])
# [0, 3]
print([i for i, x in enumerate(my_list) if 'def' in x])
# [1]

What's the best way to find the indexes of all instances where elements from one list are found in another list?

score 5 · Accepted Answer · answered Aug 13 '13 at 14:33

5

You are looking for the any() function here:

matching = [i for i, x in enumerate(my_list) if any(thing in x for thing in things_to_find)]

Demo:

>>> my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
>>> things_to_find = ['abc', 'def']
>>> [i for i, x in enumerate(my_list) if any(thing in x for thing in things_to_find)]
[0, 1, 3]

answered Aug 13 '13 at 14:33

Martijn Pieters

1,048,767
296
4,058
3,343

This has quadratic performance - good for short `things_to_find`, poor for longer such lists. – Marcin Aug 13 '13 at 14:41
@Marcin: to avoid that, you'd need to use a more efficient data structure that support searching for prefixes, or requires the input string to be split. – Martijn Pieters Aug 13 '13 at 14:42
Like, say a compiled regex? Still, +1 – Marcin Aug 13 '13 at 14:43
The actual `things_to_find` list has about 50 elements. With `timeit`, the `any(in)` solution takes 1.83 seconds, while the regex version takes 10.4. At what point will the regex be faster and more efficient? – Andrew Aug 13 '13 at 14:48
That depends on the regular expression and how the distribution of matches; `any()` exits early so if most of your matches are early in the `things_to_find` list it'll work fast, but if most of your text does *not* have a match a good regular expression pattern could be faster. – Martijn Pieters Aug 13 '13 at 14:52
@Andrew Note that 50 is not large. Thousands is large. Just as `any` can work better than worst-case when `things_to_find` is sorted in order of likelihood of a match, regex solutions will work better when the items in `things_to_find` are similar to each other. – Marcin Aug 13 '13 at 14:55
@Andrew: Your regular expression uses `.*` wildcards, introducing backtracking. They are not needed however. Remove those and the expression will already perform much better. – Martijn Pieters Aug 13 '13 at 14:58
@Andrew You're including compilation time for the regexes. That's only legitimate if you change the `things_to_find` list every time you search `my_list`. – Marcin Aug 13 '13 at 15:01
@MartijnPieters The leading `.*` is necessary if the things to find are not guaranteed to be prefixes. – Marcin Aug 13 '13 at 15:02
True. I'm still new to `timeit` :) – Andrew Aug 13 '13 at 15:04
After testing on my actual data, the `any(in)` solution works best. But @Marcin's regex thing is fantastic too. If I could accept both, I would :) – Andrew Aug 13 '13 at 15:19
@Andrew kind of you to say. – Marcin Aug 13 '13 at 15:22

score 1 · Answer 2 · answered Aug 13 '13 at 14:31

Maybe something like?:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
things_to_find = ['abc', 'def']
for n, e in enumerate(my_list):
    for m in things_to_find:
        if m in e:
            print '%s is in %s at %s' % (m, e, n)

Output:

abc is in abc-123 at 0
def is in def-456 at 1
abc is in abc-456 at 3

score 1 · Answer 3 · answered Aug 13 '13 at 14:32

1

You are close:

matching = [i for i, x in enumerate(my_list) for keyword in things_to_find if keyword in x]

which gives [0,1,3].

You need to iterate through the things_to_find list as well, and see if the keyword is in x.

answered Aug 13 '13 at 14:32

jh314

27,144
16
62
82

-1 because `any` would be more readable, idiomatic, and efficient. – Marcin Aug 13 '13 at 14:44

score 1 · Answer 4 · answered Aug 13 '13 at 14:33

1

Might be a little slow, but why not try:

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
things_to_find = ['abc', 'def']
for thing_to_find in things_to_find:
    matching = [i for i, x in enumerate(my_list) if thing_to_find in x]

answered Aug 13 '13 at 14:33

Josh

1,032
2
12
24

1

I like how this could easily make a dict of things to occurrences with this. +1 – mr2ert Aug 13 '13 at 14:42

Marcin · Answer 5 · 2013-08-13T15:12:24.217

Build a regex, then test each list element against that:

import re
#must use search, not match because no wildcards, unless only looking for prefixes
regex = re.compile('|'.join(re.escape(interest) for interest in things_to_find))

Don't rebuild the regex every time you do the search - only rebuild when things_to_find changes.

I suspect you don't want the indices, but the elements:

[x for x in my_list if regex.search(x)]

Or, if you really do want the indices:

[i for i,x in enumerate(my_list) if regex.search(x)]

This will likely perform better than an any(in) solution (which is quadratic) for large things_to_find lists, but will be overkill for short lists. You'll also see more of a gain where the things in things_to_find are similar; and less of a gain if you can sort things_to_find such that more likely matches occur first, and if matches are likely.

You do not need to add `.*` here. Use `regex.search()` instead and avoid the backtracking. You also want to use `re.escape()` on `interest` to avoid enabling potential regular expression metacharacters in the things of interest. — Martijn Pieters, Aug 13 '13 at 15:07

score 0 · Answer 6 · answered Aug 13 '13 at 14:39

0

my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
things_to_find = ['abc', 'def']
matching = [[i for i, x in enumerate(my_list) if y in x]for y in things_to_find]

answered Aug 13 '13 at 14:39

Paul Evans

27,315
3
37
54

Find list index of list items within other list items

6 Answers6

Linked