Eliminate duplicates from a binary matrix. Can it be done in time better than O(n^2)

Question

Input [0 1 0 0 1, 1 0 1 1 0, 0 1 0 0 1, 1 1 1 0 0]

Expected output [0 1 0 0 1, 1 0 1 1 0, 1 1 1 0 0]

Solution I could come up with was:

For each row, convert them into decimal (or use some checksum method), takes O(n)
This essentially converts the matrix into 1-dimensional array.
Now use hash table, scan through all the elements
Keep track of duplicates and report only unique elements from this array.

Other solutions may include using a TRIE (or a similar structure). But that would still take O(n^2)

Is there a better solution ?

There are `n^2` cells in the matrix, and you need to read all of them, so any possible algorithm would be at least `O(n^2)`. Well, depending on what `n` is. — Rosh Oxymoron, Jun 13 '11 at 18:27
Assuming the n is the dimension of the matrix (i.e. nxn) you will never be able to achieve less than O(n^2) as that is the time required to scan the whole thing. — phkahler, Jun 13 '11 at 18:29
@Ross, phkahler - What you are calling N^2 items is actually just N. The number of rows and columns doesn't matter when calculating Big-O; the total number of elements does. This could just as easily be a one-dimensional array of N elements from which you are instructed to take the elements X at a time, and your Big-O notation shouldn't change, but the way you're counting it would. — KeithS, Jun 13 '11 at 18:38
If you convert the matrix into a integer array, this [answer](http://stackoverflow.com/questions/1532819/algorithm-efficient-way-to-remove-duplicate-integers-from-an-array/1533242#1533242) could be of interest. — Christian Ammer, Jun 13 '11 at 19:21

KeithS · Accepted Answer · 2011-06-13T19:27:12.610

You can do it in linear time by computing a hash of each row, BucketSorting the hashes (fastest integer sort ever devised), and then removing duplicates from the sorted row (for each row, you compare the current row with the previous, and if it matches, delete the current).

EDIT: Because everyone got downvoted, apparently by someone who doesn't understand that iterating N items is linear regardless of how they're arranged, I will elaborate.

Big-O calculation does not take into account how a collection is arranged in memory, UNLESS the storage mechanism doesn't allow for effectively constant retrieval time. Arrays, no matter how many dimensions, are considered effectively constant to retrieve from. So, we should consider going through the entire 5x5 matrix as a linear operation, because it essentially performs the same as if you'd been given a one-dimensional array of 25 objects.

With that out of the way:

Hashing all elements, taken five at a time, is linear, because we need to read every element exactly once to add them to that row's hash (which can be as simple as multiplying each element by 10^x or 2^x and adding to a running total).
The BucketSort algorithm executes in X*M time for a one-dimensional array of X elements with maximum order of magnitude M. As X in this case is the square root of our total N for the overall operation, and the worst-case maximum order of magnitude M would also be the square root of N, our BucketSort will execute in O(X*M) ~= O(N) worst-case.
Iterating through the sorted hashes is linear, on the order of the square root of our total N.

So, the total complexity of this algorithm, executed on a matrix of N values, is roughly O(2N+sqrt(N)) which is considered O(N).

I dont know why you think that this is linear time... This solution is O(n+n+k+n) best case and O(n^2+n^2+n) worst case. not O(n) — Skyler Saleh, Jun 13 '11 at 18:58
@JustSid: Why you think the solution is slower? Compute Hash O(n), BucketSort O(n) (if input array is normal distributed), remove duplicates O(n). Gives O(3n) or am i wrong? — Christian Ammer, Jun 13 '11 at 19:05
@JustSid - and I fail to see how you think this is slower. I went through the algorithm step-by-step and the most significant term is linear, so the algorithm is linear. — KeithS, Jun 13 '11 at 19:10

score 0 · Answer 2 · answered Jun 13 '11 at 18:31

0

Why dont you store the binary values inside an integer (like you would a bitfield) then sort the integers using either quick or merge sort. Then iterate through the sorted list checking for duplicates. Duplicate values will always be directly next to each other since it is sorted. This would take O(n log n +n) where n is the amount of rows in your matrix. however each operation would be incredibly fast because it would be made up of comparisons, swaps, and equality checks of an integer which is very fast on modern hardware.

answered Jun 13 '11 at 18:31

Skyler Saleh

3,961
1
22
35

but to get them into an integer, you need to go through an entire row ( O(n) ) * number of rows -> o(n^2). or am I wrong? – iliaden Jun 13 '11 at 18:33
No, because you will be storing the values originally in an integer similar to a bitfield. You will have to store the data somewhere to begin with, so why not in an easy to process format ;) – Skyler Saleh Jun 13 '11 at 18:34
@iliaden - See my comment on the OP. N should be considered as the total number of elements in the input, not the cardinality of just one dimension. If iterating a one-dimensional array of 25 elements is linear (which it is, by definition), iterating a 5x5 matrix containing 25 elements is also linear and so will execute in roughly equal time. – KeithS Jun 13 '11 at 18:42
@KeithS fine, the time for this solution would be O(ceil(n/x)^ceil(n/x)*log(ceil(n/x)^ceil(n/x))+ceil(n/x)) where n is the amount of elements and x is the amount of bits in the largest integer type on the system. BTW, I was telling him an alternative way to store the elements so the algorithm would operate on a smaller n to begin with. Since the algorithm would be removing duplicate integers instead of duplicate rows, because of the properties of bitfields. – Skyler Saleh Jun 13 '11 at 18:48
@KeithS - Absolutely. I got n jumbled with N. Yes there can't be a faster solution, because we need to at least read all the N(=n*n) elements. – Karthik Jun 13 '11 at 18:59
@RTS - Sure this looks like a better way of storage to process the binary array, although there might be a few downsides. We would be restricting ourselves with the number of bits available for int storage. Also, do we really need to sort? Wont hashing suffice? – Karthik Jun 13 '11 at 19:06
You guys don't get it. "n" should be equal to "N", and both of them should be the total number of elements in the collection (25), NOT the number of rows or columns (5). This is why it's an interview question; it stumps people who think of N incorrectly. iterating through N elements in an array is O(N) and it doesn't matter how many dimensions the array has. The N elements are the input cardinality on which you calculate Big-O, and it is possible to perform the requested operation on them in time linear to the input cardinality. – KeithS Jun 13 '11 at 19:06
@KeithS, so an interviewer wants you to choose the algorithm that looks faster on an imaginary scale, than one that works faster in practice, I could of sworn companies want results not trivia. The interviewer said give me the fastest algorithm. – Skyler Saleh Jun 13 '11 at 19:10
@karthik you can eliminate the downside of lack of bits by using more ints, the second O algorithm I showed had this factored into it. – Skyler Saleh Jun 13 '11 at 19:12
You don't even realize that what I'm doing and what you're doing are the same damn thing; I'm just using a linear sort while you're using a NlogN sort. By combining the bits into an integer number, you are performing a hash, which takes linear time because you have to iterate through the bits and add them at the proper order of magnitude to a running total which is your bitfield. Then you sort, then you dedupe the sorted values. – KeithS Jun 13 '11 at 19:15
You're either thinking the conversion to a bitfield is free (it isn't; at some level you're paying for it), or that the algorithm should be timed based only on rows. Even if it should be, then in that case it's wrong to consider iterating through columns as contributing to the overall complexity, because rows have become your "element" and columns are insignificant to the overall complexity of the operation. You can't have it both ways. – KeithS Jun 13 '11 at 19:18
@KeithS almost universally in published literature considering complexity of matrix operations, N is the greater of the number of rows or columns not the total number of elements. For example, complexity of matrix multiplication is written O(N^3) and Strassen multiplication is considered to be O(N^2.8) or so, not O(N^1.5) and O(N^1.4) as your convention would have it. Your algorithm is similar to the OP, who writes it as O(N^2), following the hegemonic convention. – Pete Kirkham Jun 14 '11 at 18:51

Eliminate duplicates from a binary matrix. Can it be done in time better than O(n^2)

2 Answers2