Finding unique sequences in large arrays

Question

For a given bit array as shown below (x is 1, . is 0) we have an existing algorithm to determine that each row is unique:

XXX...
X.XXXX
XX..X.
.XX...

However, performance suffers if the number of rows is very large. Can anyone suggest algorithms that will perform better to find these unique row sequences if the array has 1 million rows?

In other words, these rows:

XXX...
X.XXXX
XX..X.
.XX...
XXX...
X.XXXX
XX..X.
.XX...
XXX...
X.XXXX
XX..X.
.XX...

will always have the sequences shown in the first array no matter how big the array height ?

It should be fairly simple IIRC: take any row.Say it appears at position 4 and repeats at position 55 for the first time after 4. Check if the rows in positions 5 and 56 are identical and so on until you get to the same row as 4. Repeat this until the end of the array. — user1952500, Jul 09 '15 at 21:01
You can do it using linq `group by`. Also maybe you could include how you define you array is a `string[]` or `char[][]` that help us a litle bit understand what you have instead of guess. — Juan Carlos Oropeza, Jul 09 '15 at 21:09
@user1952500 this is very simple and completely unefficient I am looking for a good algorithm in this regard — hassan alrehamy, Jul 09 '15 at 21:09
@JuanCarlosOropeza the question is very clear and its very obvious that this is a bitarray representation — hassan alrehamy, Jul 09 '15 at 21:11
Maybe is very obvious for you. But you ask for C# so at least you should post the C# code you already try so we can help you more easy. Also what is efficient for you? Tell us something about context. I think group by will tell you how many time each sequence appear on the array. — Juan Carlos Oropeza, Jul 09 '15 at 21:20
For arrays that are six elements wide where each element can only have two possible values, there are only 64 possible options for the array. If you have 1Million of them, there's no way they can all be unique. — Joel Coehoorn, Jul 09 '15 at 21:27
@hassanalrehamy, unfortunately you have to traverse the whole set of sequences at least once and it can be proven that the inefficient way is the best way. For example, suppose you don't look at the last record. The last record may be the one which is not conforming to the structure above. — user1952500, Jul 09 '15 at 21:27
And, no, the question is not clear at all. The reference to the "first array" in the last sentence lacks sufficient context. We're missing part of the problem. — Joel Coehoorn, Jul 09 '15 at 21:29
First you can use an brute force algorithm to find if sequence appear at least twice. Take a look to related questions http://stackoverflow.com/questions/3580987/fast-algorithms-for-finding-unique-sets-in-two-very-long-sequences-of-text and http://stackoverflow.com/questions/5391264/best-way-to-find-out-if-ienumerable-has-unique-values. — Mihai8, Jul 09 '15 at 21:36
@JoelCoehoorn I assume that the width of each row (the number of bits per number) is not constant at 6, because if so, you're right, this question would be trivial to solve (`return false;`). :) — Jashaszun, Jul 09 '15 at 21:58
@TimFreese Yes, they only have `X` and `.`. As the question states, `X` is `1` and `.` is `0`, and they are bit arrays. — Jashaszun, Jul 09 '15 at 21:59
@Jashaszun This is a lot more clear now than it was when TimFreese first commented. — Joel Coehoorn, Jul 09 '15 at 23:29

score 0 · Answer 1 · answered Jul 09 '15 at 21:39

Assuming the bit arrays aren't too long, here is what I would do:

Convert each bit array into a number.
Create a HashSet<int>.
Put each number into the hash set. If Add returns false the number wasn't unique

Now, you can either remove that number from hash set or have a second set with the non-unique numbers. Which way is better depends on how many rows are duplicate in general.

Finding unique sequences in large arrays

1 Answers1