Python large array comparison

Question

I have a large array containing URL (it can contains 100 000 URL strings), and I would like to know if my actual URL is one of the URL from the array. For that, I have to compare the actual URL string with all the URL string in the array. Is there any way to compare with this large array but with less time than I do now ? For now it's :

error = 0
for oldUrl in urlList:
    error = 1 if oldUrl == actualUrl else error

score 1 · Answer 1 · answered Oct 19 '16 at 11:46

1

To check if a list contains an item, use: item in list.

So, you can write:

error = oldUrl in urlList

answered Oct 19 '16 at 11:46

Laurent LAPORTE

21,958
6
58
103

Matt P · Accepted Answer · 2016-10-19T15:57:09.513

As already mentioned by @Laurent and @sisanared, you can use the in operator for either lists or sets to check for membership. For example:

found = x in some_list
if found: 
    #do stuff
else:
    #other stuff

However, you mentioned that speed is an issue. TL;DR -- sets are faster if the set already exists. From https://wiki.python.org/moin/TimeComplexity, checking membership using the in operator is O(n) for list and O(1) for set (like @enderland pointed out).

For 100,000 items, or for one-time-only checks it probably doesn't make much of a difference which you use, but for a larger number of items or situations where you'll be doing many checks, you should probably use a set. I did a couple of tests from the interpreter and this is what I found (Python 2.7, i3 Windows 10 64bit):

import timeit
#Case 1: Timing includes building the list/set
def build_and_check_a_list(n):
    a_list = [ '/'.join( ('http:stackoverflow.com',str(i)) ) for i in xrange(1,n+1) ]
    check = '/'.join( ('http:stackoverflow.com',str(n)) )
    found = check in a_list
    return (a_list, found)

def build_and_check_a_set(n):
    a_set = set( [ '/'.join( ('http:stackoverflow.com',str(i)) ) for i in xrange(1,n+1) ] )
    check = '/'.join( ('http:stackoverflow.com',str(n)) )
    found = check in a_set
    return (a_set, found)

timeit.timeit('a_list, found = build_and_check_a_list(100000)', 'from __main__ import build_and_check_a_list', number=50)
3.211972302022332

timeit.timeit('a_set, found = build_and_check_a_set(100000)', 'from __main__ import build_and_check_a_set', number=50)
4.5497120006930345

#Case 2: The list/set already exists (timing excludes list/set creation)
check = '/'.join( ('http:stackoverflow.com',str(100000)) )

timeit.timeit('found = check in a_list', 'from __main__ import a_list, check', number=50)
0.12173540635194513

timeit.timeit('found = check in a_set', 'from __main__ import a_set, check', number=50)
1.01052391983103e-05

For 1 million entries, to build and/or check membership on my computer:

#Case 1: list/set creation included
timeit.timeit('a_list, found = build_and_check_a_list(1000000)', 'from __main__ import build_and_check_a_list', number=50)
35.71641090788398

timeit.timeit('a_set, found = build_and_check_a_set(1000000)', 'from __main__ import build_and_check_a_set', number=50)
51.41244436103625

#Case 2: list/set already exists
check = '/'.join( ('http:stackoverflow.com',str(1000000)) )

timeit.timeit('found = check in a_list', 'from __main__ import a_list, check', number=50)
1.3113457772124093

timeit.timeit('found = check in a_set', 'from __main__ import a_set, check', number=50)
8.180430086213164e-06

score 0 · Answer 3 · edited May 23 '17 at 12:00

Don't use a list for this. Lookups in lists have a worst case complexity of O(n).

Use a set (or dictionary if you have other metadata) instead. This has a lookup of roughly O(1). See here for comparisons between a set, dictionary, and list.

Using a set, the lookup is simple:

urls = set(['url1', 'url2', 'url3'])

print ('url2' in urls)
print ('foobar' in urls)

Or in your case, convert your list object as a set:

urlListSet = set(urlList)
print(oldUrl in urlListSet)

You can also add new urls to your set:

urlListSet.add(newurl)
urlListSet.update(listOfNewUrls)

Python large array comparison

3 Answers3