2

I've been wondering about worst-case time complexity of telling if two unordered arrays consist of same elements. Elements could be in any type. Numbers, strings, custom objects... etc, but let's assume that elements are both sortable and hashable.

I've thought of three methods, which are well-explained in this stackoverflow post. Which are 1) use hash 2) use sorting 3) just loop through.

The post said that it is possible to achieve worst-time O(n) if data is hashable, however, I think that is not quite right since inserting and searching in Hash is not an worst-case O(1) operation. It's O(1) on average, if no collision occurs, but it's O(n) in both inserting and searching (in theory). So If a lot of collision happens, using hash to tell two arrays are equal will cost O(n^2). (please correct me if I'm wrong.)

So it seems to me that telling two arrays are equal will cost as much as sorting the arrays, which, without any knowledge about the array, would cost O(nlogn). (with assumption that comparing two elements equal will always cost O(1))

Is it possible to tell two arrays are equal in worst-case O(n)? I'll appreciate any comments, duplicate flags, reference to a paper. Thanks!

Here's my code for comparing two arrays are equal. (It's in ruby and working, but please see it more like a pseudo code)

One. compare by hashing - on average, O(n), worst-case, O(n^2)

def compare_by_hashing(list1, list2)  
  hash1 = {}  
  list1.each do |item|  
    hash1[item] ||= 0  
    hash1[item] += 1  
  end  
  hash2 = {}  
  list2.each do |item|  
    hash2[item] ||= 0  
    hash2[item] += 1  
  end  

  hash1.each do |key, hash_1_value|  
    return false if hash_1_value != hash2[key]  
  end  
  return true  
end  

Two. compare by sorting. Worst-case O(nlogn)

# 2. compare by sorting. Worst-case `O(nlogn)`
def compare_by_sorting(list1, list2)  
  list1.sort  
  list2.sort  

  list1.each_with_index do |list_1_item, index|  
    return false if list_1_item != list2[index]  
  end  
  return true  
end  

Three. Compare by just looping through. Worst case O(n^2)

def compare_by_looping(list1, list2)  
  list1.each do |item|  
    if list2.include? item  
      list2.delete item  
    else  
      return false  
    end  
  end  
  return true  
end  

Edit

I appreciate and understand answers and comments that hash operations would normally show O(1) time complexity and worst-case scenarios are very unlikely to happen. However, since they can anyways happen, I do not want to ignore the possibilities. I apologize for not making my point clear. My first intention was to find theoretically proven O(n) algorithm, not practical algorithm. Thanks for your attention. I really appreciate it.

Community
  • 1
  • 1
Cheolho Jeon
  • 359
  • 1
  • 12

3 Answers3

1

Yes, you can with hashing.

You get collisions in hashing if the hash function is really bad for the data-set and likely you only get O(N^2) if the hash function is constant (always returns 1 or something like that).

In reality you can use a cryptographic hashing function and you can be fairly sure that you don't get too many hash collisions. This is because nobody can intentionally generate inputs that have the same say SHA-1 hash (many people are trying). Or alternatively try a perfect hashing algorithm.

So your worst case analysis is based on wrong assumptions. Using good hash functions guarantees that you are always close to the average case and never in the worst case.

Sorin
  • 11,863
  • 22
  • 26
  • 1
    "and you can be fairly sure that you don't get too many hash collisions. This is because nobody can intentionally generate inputs that have the same say SHA-1 hash (many people are trying)." What about duplicate elements. Why won't they collide? Consider [1, 1, 1, 2] and [2, 2, 2, 1]. And how do you make a perfect hash-function if you don't make *any* assumptions about your data? – dingalapadum Jun 08 '16 at 12:22
  • I agree with @dingalapadum. Although hashes are well-researched, worst case scenario can always happen. – Cheolho Jeon Jun 08 '16 at 14:16
  • Check out https://en.wikipedia.org/wiki/Dynamic_perfect_hashing. You obviously count identical elements. What I'm claiming is that the worst case is so unlikely that you can pretend it never happens. – Sorin Jun 08 '16 at 14:42
  • Also if you want to be really pedantic you can claim that you can't compute any hash in O(1) and you can't compare values in O(1), since both depend on the number of bits you use to represent the values. – Sorin Jun 08 '16 at 14:44
  • @Sorin I understand your claim. Hash operation being O(n) is really unlikely, and that is why we use it so frequently. However I just cannot ignore the fact that worst case can, although very sparingly, happen. As others mentioned, if we have so many duplicate data(not in specific object type) in both arrays, lots of collisions will happen, leading us to worst case. I agree with you that the practical solution would definitely be hashing. – Cheolho Jeon Jun 08 '16 at 14:56
  • 1
    @Sorin I don't quite get it: if the question is "what is the running time of blabla in the worst-case", how does an answer of the type "the worst case is unlikely, thus it will never happen" address that question? – dingalapadum Jun 08 '16 at 14:57
  • @CheolhoJeon identical values are not a problem `hash_map[value] = x` where x is the number of instances of that value. You increment for one side and decrement for the other. No hash collisions necessary. – Sorin Jun 09 '16 at 07:35
  • @Sorin Oh you are totally right. I wrote dumb things. – Cheolho Jeon Jun 10 '16 at 02:47
0

No, it is not possible to deterministically compare 2 arrays with a worst-case runtime of O(n) if no assumptions about the data can be made.

Your analysis for the worst-case with the hash-tables is correct.

Why not?

Either you preprocess the arrays or you don't:

If you do preprocess, the best worstcase you can get is O(n*log(n)) (by sorting).

If you don't preprocess, you will need to compare each element of 1 array against each of the second -> O(n^2).


p.s.: unfortunately I didn't manage to find a formal proof yet..

dingalapadum
  • 2,077
  • 2
  • 21
  • 31
  • Please let me know if you found a formal proof. I've been wondering this for days but I cannot find a clear proof. – Cheolho Jeon Jun 08 '16 at 14:49
  • 1
    @CheolhoJeon I will try to see if I can come up with something. If you wan't to give it a try yourself: My idea is something along the lines: assume that some algorithm A can compare two arbitrary arrays in O(n) time, show that you can use this algorithm A to speed up some other algorithm of which you know that it can't be done that fast. (Just as an example, show that if A exists you can sort in less time than O(n log n))... I'll let you know if I can come up with something... don't know though when I'll find the time. – dingalapadum Jun 08 '16 at 15:08
-2

The worst case time complexity when using hashing is O(n)(Assuming you have created the hashing implementation correctly). The worst case here is in terms of input(not considering the bad implementation of hashtable).

What you are doing above is when your hashtable is implemented badly and having n collisions.

Considering you have a good hashing function which distribute your keys uniquely in the hashtable and there are no collisions, your worst case time complexity will be O(n). Since you can do comparison in a single pass. This way it is more efficient than sorting and comparing(which will need O(nlogn) time).

Rajesh Pantula
  • 10,061
  • 9
  • 43
  • 52
  • I disagree: consider duplicate elements. Show me a hash-function which can handle all duplicate elements and not having a collision. For any given hash function you can construct an input such that it will have a worst case performance of O(n) when inserting/searching – dingalapadum Jun 08 '16 at 11:47
  • The argument here is not about considering whether practically you can create a perfect hash. In asymptotic analysis, you only focus on your input and determine whether your function takes more time based on input size. whether there will be hash collisions is a different discussion and should be left out when finding complexity for input using asymptotic analysis. – Rajesh Pantula Jun 08 '16 at 11:51
  • 1
    It is about WORSTCASE analysis. Hash-tables have a worstcase of O(n) for insert, search and delete – dingalapadum Jun 08 '16 at 11:53
  • Yes I agree with you that hashtables do have O(n). The problem statement here is to find worst case analysis of array comparisons, which will be O(n). you ignore the worst case analysis of hashtable here. – Rajesh Pantula Jun 08 '16 at 11:55
  • This is not making any sense. Why would we ignore the worstcase of the hashtable? The question is about what the best worstcase you can achieve when comparing two arrays. If you are using a hashtable to solve this problem, then you have to take into consideration how much overhead that will add in the worst-case to your implementation... – dingalapadum Jun 08 '16 at 12:00
  • Assuming you have a hash table which gives O(1) insertion & O(1) search, the WORSTCASE complexity of this algorithm(alone, ignoring external factors) will be O(n). – Rajesh Pantula Jun 08 '16 at 12:05
  • *Assuming you have a hash table which gives O(1) insertion & O(1) search*. This is not a reasonable assumption!!! What kind of analysis is "the complexity of this algorithm alone, ignoring external factors"? If you have an algorithm A which does 1 thing, namely calling another algorithm B which runs in O(n), would you say that the algorithm A has a running time of O(1)? – dingalapadum Jun 08 '16 at 12:15
  • That's because we are determining the worst case complexity of this algorithm alone. We do not generally include the worst case complexity of the intermediate data structure you are using. I suggest you read about worst case complexities of 2sum problem(google for 2sum problem). it is similar to this problem. it uses a single loop and stores elements in hashtable. Going by your argument, does it have worst case time complexity of O(n^2) as well? – Rajesh Pantula Jun 08 '16 at 12:32
  • I googled and found different things: first of all, the data in the array of 2sum are `int`s. Next, one of the solutions using a hash-map is saying: *"This method works in O(n) time if range of numbers is known."* (src: http://www.geeksforgeeks.org/write-a-c-program-that-given-a-set-a-of-n-numbers-and-another-number-x-determines-whether-or-not-there-exist-two-elements-in-s-whose-sum-is-exactly-x/ ) Here OP is explicitly saying *no assumption about the data* can be made. I'm not saying that O(1) hashtables are never possible. I'm saying hashtables can't be O(1) *for all* possible cases... – dingalapadum Jun 08 '16 at 12:43
  • @Kundor: This is not wrong. This is how the algorithm complexities are calculated. It is easier to understand this from the code. for(int i=0;i – Rajesh Pantula Jun 09 '16 at 07:37
  • @RajeshPantula this is most definitely NOT how algorithm complexities are calculated!!! For your kind of analysis, this algorithm would always be O(n). If you seriously only take that "single loop" into account, then I must conclude that `for(int i=0;i – dingalapadum Jun 12 '16 at 17:39