13

I have a list that contains nested lists and I need to know the most efficient way to search within those nested lists.

e.g., if I have

[['a','b','c'],
['d','e','f']]

and I have to search the entire list above, what is the most efficient way to find 'd'?

fdsa
  • 1,379
  • 1
  • 12
  • 27

5 Answers5

14
>>> lis=[['a','b','c'],['d','e','f']]
>>> any('d' in x for x in lis)
True

generator expression using any

$ python -m timeit -s "lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]" "any('d' in x for x in lis)" 
1000000 loops, best of 3: 1.32 usec per loop

generator expression

$ python -m timeit -s "lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]" "'d' in (y for x in lis for y in x)"
100000 loops, best of 3: 1.56 usec per loop

list comprehension

$ python -m timeit -s "lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]" "'d' in [y for x in lis for y in x]"
100000 loops, best of 3: 3.23 usec per loop

How about if the item is near the end, or not present at all? any is faster than the list comprehension

$ python -m timeit -s "lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]"
    "'NOT THERE' in [y for x in lis for y in x]"
100000 loops, best of 3: 4.4 usec per loop

$ python -m timeit -s "lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]" 
    "any('NOT THERE' in x for x in lis)"
100000 loops, best of 3: 3.06 usec per loop

Perhaps if the list is 1000 times longer? any is still faster

$ python -m timeit -s "lis=1000*[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]"
    "'NOT THERE' in [y for x in lis for y in x]"
100 loops, best of 3: 3.74 msec per loop
$ python -m timeit -s "lis=1000*[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]" 
    "any('NOT THERE' in x for x in lis)"
100 loops, best of 3: 2.48 msec per loop

We know that generators take a while to set up, so the best chance for the LC to win is a very short list

$ python -m timeit -s "lis=[['a','b','c']]"
    "any('c' in x for x in lis)"
1000000 loops, best of 3: 1.12 usec per loop
$ python -m timeit -s "lis=[['a','b','c']]"
    "'c' in [y for x in lis for y in x]"
1000000 loops, best of 3: 0.611 usec per loop

And any uses less memory too

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • I would be cautious generalizing about this based on searching for `'d'` (ie something relatively close to the front of the list). If you look at the discussion @Ashwini and I had (below his answer) you'll see that the choice of the "target" in the list makes a significant difference. In our trials LC beat the generator with targets in the middle and at the end. Generator "won" if the target was near the front. I think in the end it's all very dependent on the specific dataset and target sought after. – Levon Aug 15 '12 at 04:53
  • @Levon, I'm well aware of the tradeoffs. However in this case, `any` is still faster than the LC even if the item is not present at all. – John La Rooy Aug 15 '12 at 04:58
  • I didn't mean to imply you weren't .. and the use of `any` didn't even occur to me until you brought it up, so that's something for me to keep in mind for future use. – Levon Aug 15 '12 at 05:01
  • @Levon, The only place the LC is faster is for a very short list. Unless you expect the list to always be short, it's better not to use a list comprehension here. – John La Rooy Aug 15 '12 at 05:10
  • 1
    @gnibbler how to identify which one of these to be used gen,LC or any()? coz In this case any() was faster and on the other link I posted gen was faster than any(). I was completely aware of any() but didn't used it here because of the results I saw on that question and this time any() came up with faster solution. – Ashwini Chaudhary Aug 15 '12 at 05:11
  • 1
    @AshwiniChaudhary, I think the only way to know for sure is to try. Also to furthur complicate things, the relative timings may change between different implementations/versions of Python. In general I won't use a list comprehension unless I need a list (for slicing, reversing, etc). generator expressions are a good 1st choice, although in some cases they can be beaten by using of itertools etc. – John La Rooy Aug 15 '12 at 05:22
7

Using list comprehension, given:

mylist = [['a','b','c'],['d','e','f']]
'd' in [j for i in mylist for j in i]

yields:

True

and this could also be done with a generator (as shown by @AshwiniChaudhary)

Update based on comment below:

Here is the same list comprehension, but using more descriptive variable names:

'd' in [elem for sublist in mylist for elem in sublist]

The looping constructs in the list comprehension part is equivalent to

for sublist in mylist:
   for elem in sublist

and generates a list that where 'd' can be tested against with the in operator.

Levon
  • 138,105
  • 33
  • 200
  • 191
  • 1
    Just so I can understand what is going on can you clarify what j and i are? – fdsa Aug 15 '12 at 03:34
  • @fdsa I'll update my answer by adding a "verbose" version (more descriptive variable names) – Levon Aug 15 '12 at 03:36
  • Thanks for taking the time, I appreciate it – fdsa Aug 15 '12 at 03:38
  • @Downvoter .. could I get an explanation please? This is a functional solution to OP's problem. A downvote ***without*** explanation helps *no one* (OP, SO, or me). If it's because I didn't use a generator, you can see from the discussion with Ashwini that the generator solution isn't always necessarily faster. – Levon Aug 15 '12 at 04:13
  • I expect the downvote was for generating the whole list every time. I expect that you'd have to have a fairly small list for the list comprehension to be faster or use less memory. – Gabe Aug 15 '12 at 05:09
  • @Gabe Perhaps, but it *is functional*, and OP didn't say there would be long lists, but instead provided a short list as their sample dataset they wanted a solution for. In any case, it's a matter of style, I *never* downvote a functional solution, though I may comment and make suggestions. Downvoter operates exactly the opposite way. – Levon Aug 15 '12 at 05:35
  • I would assume the *"most efficient"* wording is an indication that the OP actually has lots of data to search, otherwise they would have simply left those words out. – Gabe Aug 15 '12 at 05:46
4

Use a generator expression, here the whole list will not be traversed as generator generate results one by one:

>>> lis = [['a','b','c'],['d','e','f']]
>>> 'd' in (y for x in lis for y in x)
True
>>> gen = (y for x in lis for y in x)
>>> 'd' in gen
True
>>> list(gen)
['e', 'f']

~$ python -m timeit -s "lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]" "'d' in (y for x in lis for y in x)"
    100000 loops, best of 3: 2.96 usec per loop

~$ python -m timeit -s "lis=[['a','b','c'],['d','e','f'],[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]" "'d' in [y for x in lis for y in x]"
    100000 loops, best of 3: 7.4 usec per loop
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • What do you mean by "the whole list will not be traversed"? .. you don't terminate early, you have to generate all of the values at some point, no? The wording seems to imply the opposite. – Levon Aug 15 '12 at 03:31
  • @Levon but it doesn't generate `e` and `f`. – Ashwini Chaudhary Aug 15 '12 at 03:36
  • 1
    So it acts like a short-circuited Boolean expression? I didn't know that .. neat. Thanks, I learned something new. – Levon Aug 15 '12 at 03:42
  • Looking at your timings, it looks that the LC beats the generator even though the generator may not traverse the whole list .. overhead to set up the generator? – Levon Aug 15 '12 at 03:48
  • @Levon yes exactly and terminates as soon as the value is found, but interestingly the `timeit` results came out better for your solution. – Ashwini Chaudhary Aug 15 '12 at 03:48
  • That's what I thought too at first, but **actually** you should run your timings again but using `-n XXX` where `XXX` are the iterations - currently they are NOT the same. – Levon Aug 15 '12 at 03:49
  • I just ran them on my system, they come out the same, 12.6 ns/loop – Levon Aug 15 '12 at 03:51
  • @Levon I created a bigger list and this time LC was much slower. – Ashwini Chaudhary Aug 15 '12 at 03:53
  • Well .. that's a bit skewed too .. you should have picked a middle value to search for (on the average we should assume the value would be equally likely to be in the front or back), not one so close to the front. Either way, interesting timing results - and also different from mine for some reason). Anyway, I've stored away what I learned about generators from your answer - thanks again. – Levon Aug 15 '12 at 03:53
  • 1
    @Levon this time I searched 11 and this time LC was much faster than generator. o.O – Ashwini Chaudhary Aug 15 '12 at 03:57
  • 1
    We got the same results this time, I searched for 10 :-) and also 18, and LC beat the generator each time, though LC was slower for 'd'. – Levon Aug 15 '12 at 04:00
2

If your arrays are always sorted as you show, so that a[i][j] <= a[i][j+1] and a[i][-1] <= a[i+1][0] (the last element of one array is always less than or equal to the first element in the next array), then you can eliminate a lot of comparisons by doing something like:

a = # your big array

previous = None
for subarray in a:
   # In this case, since the subarrays are sorted, we know it's not in
   # the current subarray, and must be in the previous one
   if a[0] > theValue:
      break
   # Otherwise, we keep track of the last array we looked at
   else:
      previous = subarray

return (theValue in previous) if previous else False

This kind of optimization is only worthwhile if you have a lot of arrays and they all have a lot of elements though.

Gordon Bailey
  • 3,881
  • 20
  • 28
  • Thank you for this - I hadn't considered sorting them but they will be used quite often so I will consider sorting – fdsa Aug 15 '12 at 03:39
  • No problem - I don't know how much data you're going to have or what your exact use-case is, but if you do have a lot of data it might be worth looking into python's [collections](http://docs.python.org/library/collections.html) module, which has high-performance data structures suited to specific tasks. – Gordon Bailey Aug 15 '12 at 03:49
  • 2
    It they are always sorted, you would be better to use the `bisect` module – John La Rooy Aug 15 '12 at 04:44
2

if you just want to know that your element is there in the list or not then you can do this by converting list to string and check it. you can extend this of more nested list . like [[1],'a','b','d',['a','b',['c',1]]] this method is helpful iff you dont know that level of nested list and want to know that is the searchable item is there or not.

    search='d'
    lis = [['a',['b'],'c'],[['d'],'e','f']]
    print(search in str(lis)) 
sahasrara62
  • 10,069
  • 3
  • 29
  • 44
  • This works beautifully! Sahasraru(1K) thanks to you. If this had been C then I would have a pointer ptr and if I did *ptr = z then the list would look like this lis = [['a', ['b'], 'c'], [['z'], 'e', 'f']]. So at this point how would you replace the element in Python? Thanks – Santhosh Kumar Mar 01 '23 at 02:09