2

So I have a list of string which looks something like that -

['https://images.website.com/images/data/X/source1',
'https://images.website.com/articles/data/main_data/source2',
'https://images.website.com/harmony/data/Y/source3',
'https://images.website.com/files/data/Z/source4',
'https://images.website.com/pictures/data/T/source5']

I need to find the item which has main_data in it.

I thought of something like.


def func():
    for item in list:
        if item.index('/main_data/'):   # not -1 
            return item

return -1  # item not found

Is it the fastest way when I have a list of 100-1000 and maybe more items?

Paz Bazak
  • 63
  • 7
  • You have to at least check every string, so yes, O(N) is the fastest. – Aplet123 Nov 25 '20 at 17:04
  • 1
    But you can parallelize the process, which is only worth it for massive amounts of data, though. (Also don't name your variables after built-ins) – user8408080 Nov 25 '20 at 17:05
  • if you use that list of string several times, you can filter/sorted it first and/or save it into a dictionary in some useful way for example, and thus reducing your search space... – Copperfield Nov 25 '20 at 17:12
  • 1
    Just use `"/main_data/" in item` instead of calling `index`... – Tomerikoo Nov 25 '20 at 17:14
  • 2
    Does this answer your question? [Check if a Python list item contains a string inside another string](https://stackoverflow.com/questions/4843158/check-if-a-python-list-item-contains-a-string-inside-another-string) – Tomerikoo Nov 25 '20 at 17:15

3 Answers3

2

If you only have 100-1000 items as you mentioned, I would imagine that the method you already posted for us completes instantly, so it's hard to imagine perceivable gains from speeding things up. However, if you have many more items, it is usually faster to use a builtin function over a for loop if possible.

def func(l):
    return next(filter(lambda s: '/main_data/' in s, l))
    # raises StopIteration if no string contains /main_data/
mCoding
  • 4,059
  • 1
  • 5
  • 11
  • 1
    you can give `next` a default value to return instead of raising an exception, for example `next(iter([]),42)` return 42 – Copperfield Nov 25 '20 at 17:16
  • 1
    @Copperfield, yes you can. However, in this case I would not recommend doing that. The function purports to return the string itself, not the index, so returning -1 doesn't seem sensible (although that is what the original post did). In my opinion, it makes more sense to raise an exception and force the caller to handle it. – mCoding Nov 25 '20 at 17:18
2

There are many points to be made:

  • If performance is your concern, don't use Python for that bit.
  • If you can use several cores, do so. The task is embarrassingly parallel, although you likely need to keep the threads in a pool, because there are so few items.
  • If you roughly know where the substring is within a string, then you can avoid traversing the entire string. Same for guessing at which position the string usually is within the list.
  • If you have information about some property that allows you to cut the search down, use it.
  • If you need to search several times, perhaps for different terms, you are likely able to build a better data structure or an index rather than performing the naive search every time.
Acorn
  • 24,970
  • 5
  • 40
  • 69
1

Your function is fastest

def func_1(path_list):
     for item in path_list:
         if '/main_data/' in item:
             return item
     return -1  

For example, for 100 elements, the time processing is 0:00:00.000048 and thats is good.

path_list = 99*['https://images.website.com/images/data/X/source1']
path_list.append('https://images.website.com/articles/data/main_data/source2')

from datetime import datetime

now = datetime.now()

func_1(path_list)

end = datetime.now()
print(end-now)

0:00:00.000048

gañañufla
  • 552
  • 2
  • 12