3

I have following Python code, a and b are lists(I know it isn't best way of getting intersections):

def get_intersection(a, b):
    a = set(a)
    b = set(b)
    c = sorted(list(a&b))
    return c

Let's call len(a) - m and len(b) - n, where is no additional information about a and b. Then time complexity of given code is O(m) + O(n) + O(m * n) + O((m + n) * log(m + n)).

I definitely can shorten O(m) and O(n), because they are much less than O(m * n), but what should I do with O((m + n) * log(m + n))?

How do i compare O(m * n) and O((m + n) * log(m + n))? Should I keep O((m + n) * log(m + n)) in final evaluation?

vszholobov
  • 2,133
  • 1
  • 8
  • 23
  • Asymptotically O(mn) is dominant for sure. Someone with fresh calculus is invited to prove it :) – Eugene Sh. Jul 23 '21 at 13:28
  • `m` and `n` are not separate components. You combined input size is just the cumulative size of both arguments. – chepner Jul 23 '21 at 13:29
  • `a & b` wouldn't be O(m*n) anyway; it's O(m + n ) as well, because set lookup is O(1). – chepner Jul 23 '21 at 13:30
  • 2
    `c = sorted(a&b)` would also be sufficient; `sorted` returns a list, but can sort any iterable. – chepner Jul 23 '21 at 13:30
  • @chepner according to python wiki worst case of sets intersection is O(m * n): https://wiki.python.org/moin/TimeComplexity – vszholobov Jul 23 '21 at 13:32
  • 2
    It's a very odd worst-case. Raymond Hettinger discusses this in an [answer](https://stackoverflow.com/a/8102536/1126841) and a comment on that answer. In short, it only happens if you have data that uses a laughably bad hash function in the first place. – chepner Jul 23 '21 at 13:40
  • For example, there is no worst case for sets of integers that would achieve O(n^2) behavior. It's entirely dependent on the type of value *in* the set, not the set itself. – chepner Jul 23 '21 at 13:49

2 Answers2

1

You can treat the total input size as n; it doesn't really matter which argument contributes what to that total. (The two extremes are when one or the other argument is empty; moving items from one argument to the other doesn't change the overall amount of work you'll be doing.)

As such, both set(a) and set(b) are O(n) operations.

a & b is also O(n); you don't need to compare every element of a to every element of b to compute the intersection, because sets are hash-based. You basically just make O(n) constant-time lookups. (I am ignoring the horrific corner case that makes set lookup linear. If you data has a hash function that doesn't map every item to the same value, you won't hit the worst case.)

sorted(a&b) (no need to create a list first, but that's also just an O(n) operation) takes O(n lg n).

Because each of the preceding operations is performed in sequence, the total complexity of get_intersection is O(n lg n).

chepner
  • 497,756
  • 71
  • 530
  • 681
  • "You can treat the total input size as n": no, this is wrong. m and n are independent numbers, and so are m+n and mn. –  Jul 23 '21 at 15:08
  • But there's no *reason* to create the two lists separately. Neither `m` nor `n` ever appear apart in the overall complexity, only in combination `m + n`, so it doesn't make any difference whether one is larger than the other. Cf. a graph algorithm that's O(n lg n + m) that scales differently depending on whether it is sparse (m = O(n)) or dense (m = O(n^2). – chepner Jul 23 '21 at 15:53
  • I've already dispensed with any need to consider the product `mn`. – chepner Jul 23 '21 at 15:54
  • "I've already dispensed with any need to consider the product mn": this is wrong. –  Jul 23 '21 at 21:40
  • I'm ignoring that, because it's not a realistic concern. Set intersection is only quadratic if you do something stupid and put data that can't be properly hashed in the set. – chepner Jul 23 '21 at 21:41
  • This is wrong. Asymptotic complexity is asymptotic complexity. You may not cherry pick the terms you like. And even less disregard the expression given to you. –  Jul 23 '21 at 21:43
  • I'm taking a more practical stance: this is specifically a Python question, not an algorithm on a theoretical Turing machine. If there is no point in using a `set` to improve performance for a data type that can't take advantage of the `set`, I'm going to ignore it. – chepner Jul 24 '21 at 13:51
  • Don't be so bad faith. The question is about asymptotic complexity using the big-O notation, you know it very well. And please don't try to discredit me by random references to the theory of algorithms. –  Jul 24 '21 at 14:14
0

I don't think that you can simplify the expression.

Indeed, if you set m to a constant value, say 5, you have a complexity

5n + (n+5)log(n+5) = O(n log(n))

and the first term is absorbed.

But if you set m = n,

n² + 2n log(2n) = O(n²)

and this time the second term is absorbed. Hence all you can write is

O(mn + (m+n) log(m+n)).

With a change of variable such that s is the sum and p the product,

O(p + s log(s)).
  • You answer is right for algorithm with complexity O(m * n) + O((m + n) * log(m + n)), but I incorrectly assumed that the complexity of intersection in python is O(m * n). In fact it is O(m + n), as chepner pointed out in comment section. So for my code final complexity is O(m + n) + O((m + n) * log(m + n)) = O((m + n) * log(m + n)) – vszholobov Jul 23 '21 at 15:55