1

My understanding of list and set in Python are mainly that list allows duplicates, list allows ordered information, and list has position information. I found while I was trying to search if an element is with in a list, the runtime is much faster if I convert the list to a set first. For example, I wrote a code trying find the longest consecutive sequence in a list. Use a list from 0 to 10000 as an example, the longest consecutive is 10001. While using a list:

    start_time = datetime.now()
    nums = list(range(10000))
    longest = 0
    for number in nums:
        if number - 1 not in nums:
            length = 0
            ##Search if number + 1 also in the list##
            while number + length in nums: 
                length += 1
            longest = max(length, longest)
    end_time = datetime.now()
    timecost = 'Duration: {}'.format(end_time - start_time)
    print(timecost)

The run time for above code is "Duration: 0:00:01.481939" With adding only one line to convert the list to set in third row below:

    start_time = datetime.now()
    nums = list(range(10000))
    nums = set(nums)
    longest = 0
    for number in nums:
        if number - 1 not in nums:
            length = 0
            ##Search if number + 1 also in the set(was a list)##
            while number + length in nums:
                length += 1
            longest = max(length, longest)
    end_time = datetime.now()
    timecost = 'Duration: {}'.format(end_time - start_time)
    print(timecost)

The run time for above code by using a set is now "Duration: 0:00:00.005138", Many time shorter than search through a list. Could anyone help me to understand the reason for that? Thank you!

Joey
  • 125
  • 8
  • 1
    Yes, the computational complexity of searching of item in a set is O(1), i.e. it's approximately constant time and doesn't depend on the size of the set. On the other hand, searching of item in a list is O(n) on an average, so the time required grows linearly with the size of the list. The main reason, I believe, is that sets are hashed, similar to how searching through keys of a dictionary is also O(1) complexity. – NotAName May 05 '22 at 00:56

1 Answers1

3

This is a great question.

The issue with arrays is that there is no smarter way to search in some array a besides just comparing every element one by one.

  • Sometimes you'll get lucky and get a match on the first element of a.
  • Sometimes you'll get unlucky and not find a match until the last element of a, or perhaps none at all.
  • On average, you'll have to search half the elements of they array each time.

This is said to have a "time complexity" of O(len(a)), or colloquially, O(n). This means the time taken by the algorithm (searching for a value in array) is linearly propertional to the size of the input (the number of elements in the array to be searched). This is why it's called "linear search". Oh, your array got 2x bigger? Well your searches just got 2x slower. 1000x bigger? 1000x slower.

Arrays are great, but they're for searching if the element count gets too high.

Sets are clever. In Python, they're implemented as if they were a Dictionary with keys and no values. Like dictionaries, they're backed by data structure called a hash table.

A hash table uses the hash of a value as a quick way to get a "summary" of an object. This "summary" is then used to narrow down its search, so it only needs to linearly search a very small subset of all the elements. Searching in a hash table a time complexity of O(1). Notice that there's no "n" or len(the_set) in there. That's because the time taken to search in a hash table does not grow as the size of the hash table grows. So it's said to have constant time complexity.

By analogy, you only search the dairy isle when you're looking for milk. You know the hash value of milk (say, it's isle) is "dairy" and not "deli", so you don't have to waste any time searching for milk

A natural follow-up question is "then why don't we always use sets?". Well, there's a trade-off.

  • As you mentioned, sets can't contain duplicates, so if you want to store two of something, it's a non-starter.
  • Prior to Python 3.7, they were also unordered, so if you cared about the order of elements, they won't do, either. * Sets generally have a larger cpu/memory overhead, which adds up when using many sets containing small numbers of elements.
  • Also, it's possible that because of fancy CPU features (like CPU caches and branch prediction), linear searching through small arrays can actually be faster than the hash-based look-up in sets.

I'd recommend you do some further reading into data structures and algorithms. This stuff is quite language-independent. Now that you know that set and dict use a Hash Table behind the scenes, you can look up resource that cover hash tables in any language, and that should help. There's also some Python-centric resoruces too, like https://www.interviewcake.com/concept/python/hash-map

Alexander
  • 59,041
  • 12
  • 98
  • 151
  • I wouldn't call "O(N)" a colloquialism. It's a proper way of expressing how complexity behaves at infinity, otherwise the complexity of searching element in a list is *O(len(a) / 2)* on an average https://en.wikipedia.org/wiki/Big_O_notation – NotAName May 05 '22 at 01:17
  • @pavel I would. The hand waved nature of how “n” is defined makes people prone to accidentally Mia communicating about time complexities, or rolling up disparate variables into it. For example, when searching a square matrix for a value, would that be `O(n)` (where n is the total number of elements), or `O(n^2)` (where n is the height and width). Both are valid, but neither is more obviously correct. Being less handy-wavy and more explicit, like `O(w*h)`, totally solves that problem. – Alexander May 05 '22 at 01:58
  • Though “n” is useful in generalizing complexity classes or comparing them in abstract, e.g. for questions like “what’s slower, `O(n!)` or `O(2^n)`?”. I maintain that it’s not a good way to talk about one specific algorithm or another. Btw `O(len(a) / 2)` and `O(len(a))` are equivalent. The constant factor of 2 gets cancelled out when you expand the definition and compute the limit. – Alexander May 05 '22 at 01:59
  • I think that for the matrices the standard way of measuring complexity is O(n*m) (which becomes O(n^2) in the case of a square matrix) for the simple reason that programmatically searching an element in a matrix is normally represented by nested for-loops. Here O(n) would be just wrong since there's usually no algorithm you can implement that will scale based on just one variable in this case. – NotAName May 05 '22 at 02:34
  • I liked your answer until I got to the because it seems so well…juvenile. – martineau May 05 '22 at 02:36
  • @pavel you illustrate my point by saying “based on one variable”. People can reasonably disagree what we’re calling a variable. The correct time complexity is dependant on how you define your variable. If you define “n” as “the number of elements in a matrix”, then linear search in a matrix is `O(n)`. That may sound contrived, but most matrix math libraries lay out elements of multi dimensional matrices using one single contiguous array of elements. – Alexander May 05 '22 at 03:07
  • @martineau I think it’s important that I don’t take myself too seriously . I’m glad you enjoy it – Alexander May 05 '22 at 03:07
  • But as far as computer algorithms are concerned, O(n) in the case of a matrix is meaningless. Matrices have dual indexing so in order to iterate over elements of a matrix you need to iterate over both indices. Maybe there are languages where it is not so but in a general case this is true. And since in this context we necessarily imply that big-O notation represents a complexity of an algorithm, you can't really use O(n) (since you can't simply access nth element of the matrix) and you can't use it to justify your position. – NotAName May 05 '22 at 03:13
  • @pavel please look up any popular matrix library and see how they implement pointwise addition or addition of a scalar. You won’t find a single nested loop. “ Matrices have dual indexing so in order to iterate over elements of a matrix you need to iterate over both indices.” that’s true at the user api level, but it’s absolutely not true of the underlying data structure. A well implemented 2d matrix uses a single buffer that stores all rows (or cols, if using column major order) end on end, which can *absolutely* be indexed by a single index, and it is. – Alexander May 05 '22 at 13:30
  • To clarify, I mean a generalized multidimensional matrix library, like the data structures that back R or Matlab, and not some 2D specific matrix implementation – Alexander May 05 '22 at 13:38
  • Agree, but as I understand it, even though the values are stored as a contiguous array in memory, but correct access specifically for matrix operations still requires internal calculation of the respective indices. – NotAName May 06 '22 at 02:53
  • @pavel It doesn't. Imagine you have a 3x3 2D matrix, and you're adding it to another 3x3 2D matrix. You *could* implement it with 2 nested `for` loops. But that would be needlessly non-generic (what if I want a 1D matrix, 3D matrix, or 100,000D matrix?). You could go into the effort to have `(i, j)` equal to `(0, 0)`, `(0, 1)`, `(0, 2)`, then `(1, 0)` ... `(2, 2)`. But why bother? WHen it comes to finding the memory addr of each element, you end up doing `index = base_pointer + i * matrix_with + j`... it boils down to 1 number in the end, from `0` to `8`. – Alexander May 06 '22 at 03:07
  • @pavel There's no point maintaining 2 loop counters, only to end up flattening them down using index math into a single index into your buffer. You're better just looping 1-dimmensionally and doing your element-wise addition as if you had two 9 element 1D arrays. – Alexander May 06 '22 at 03:08
  • I didn't know these details. Good discussion, thank you! – NotAName May 06 '22 at 05:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/244544/discussion-between-alexander-and-pavel). – Alexander May 06 '22 at 13:05