6

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.

Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...

Community
  • 1
  • 1
Phroggyy
  • 423
  • 3
  • 10
  • Sort/traverse. O(n*log n). Or with with a space complexity of O(n) it can be done in O(n) time by having an array where each index is corresponding to an element. And then just put `1` in this array for each element found. – Eugene Sh. Oct 14 '16 at 13:47
  • Are the elements sorted or hash able? This will be linear regardless if neither of those are the case. – Carcigenicate Oct 14 '16 at 13:47
  • 1
    If the data isn't sorted, you'll have to search through it from start to end and compare each item, no way around it. Depending on how much you are going to use the data, it may or may not be worth sorting it first. – Lundin Oct 14 '16 at 13:47
  • @Lundin it will only be done once. It's a single check. Which makes me think sorting first, as suggested by Eugene, won't be very efficient – Phroggyy Oct 14 '16 at 13:49
  • @Carcigenicate How will it be linear if not sorted? If we assume O(1) memory. If it is O(n) memory, it can be linear as outlined. – Eugene Sh. Oct 14 '16 at 13:50
  • 3
    You don't need to sort to find duplicates. You could insert each string in a hash table and check for duplicates on collisions. Expected complexity for hash lookup and insertion is O(1) and therefore you can have a duplicate-detection algorithm that runs in expected O(N). In fact, I'd be very surprised if such an approach is not featured in some other question. – vmg Oct 14 '16 at 13:51
  • @vmg But the worst case scenario would be quadratic, if I am not mistaken... – Eugene Sh. Oct 14 '16 at 13:52
  • Like I said, a very quick search already yields [this](http://stackoverflow.com/questions/16589163/basic-hashtable-algorithm-removing-duplicates) question, with essentially the same idea. – vmg Oct 14 '16 at 13:52
  • @EugeneSh. If the collection isn't sorted or hashable, in the worst case scenario, you'll need to search the entire collection, which is O(n). It could only be O(1) using an identity lookup on a hashtable. – Carcigenicate Oct 14 '16 at 13:52
  • @Carcigenicate How will you search an unsorted array for duplicates in O(n) ??? – Eugene Sh. Oct 14 '16 at 13:53
  • @EugeneSh. By going over the entire list once, and maintaining a set of seen values. Without any order in the list, this is a linear search. – Carcigenicate Oct 14 '16 at 13:55
  • @Carcigenicate OK, so that's what I said to be O(n) space complexity in the first comment. – Eugene Sh. Oct 14 '16 at 13:55
  • @EugeneSh. Ya, which is why I'm confused as I said it was linear from the start, then you appeared to say, 'this isn't linear, it's linear". – Carcigenicate Oct 14 '16 at 13:58
  • I just added another detail I just remembered. The strings have a specific format. – Phroggyy Oct 14 '16 at 14:00
  • Ha. In this case you can do a linear bucket/radix sort and end up with O(n) in any case. And yeah, the space complexity here for the linear approach will be here limited by the number of possible combinations, so theoretically it is O(1) space. So you are getting O(n) with any approach. – Eugene Sh. Oct 14 '16 at 14:01
  • 1
    You can, in a single pass, assign to each string an integer value between 0 and 255 (a kind of signature, like a checksum mod 256) and create an array of 256 lists of integers and place in front of these entries the line numbers having same signature. This helps group which lines may be identical. – Jean-Claude Colette Oct 14 '16 at 14:02
  • I added some further detail now as well, as this is not purely theoretical, I have to take into consideration the speed of e.g string comparisons – Phroggyy Oct 14 '16 at 14:23
  • Do you mean 3 _digits_ `NNN`? Are they base 10? – Useless Oct 14 '16 at 14:27
  • " I have a file containing several million entries" --> this is `N` entries. Is `N` known at the beginning of the search or not until the end? – chux - Reinstate Monica Oct 14 '16 at 14:48
  • 1
    Yes, sorry, 3 digits, base 10. @chux 6 million, known – Phroggyy Oct 14 '16 at 14:49
  • 2
    Hmmmm if 26 different characters and 10 different digits make 26*26*26*10*18*10 combinations, why not use a 17,576,000 bit table (2,197,000 bytes)? – chux - Reinstate Monica Oct 14 '16 at 14:56
  • What can be the `XXX` characters? If only say uppercase letters, you can have a boolean array of 26^3 * 10^3 elements, which can be stored in just over 2Mb. Then, mark each entry you see in the array, and if the mark is already present, you found your first duplicate. – Gassa Oct 14 '16 at 14:56
  • @Gassa uppercase only, yes – Phroggyy Oct 14 '16 at 14:57
  • @Phroggyy "I just added another detail I just remembered. The strings have a specific format" translates to "I just defined a much simpler sub-problem". Please avoid missing out on important details like that in future questions. Also, someone should really clean up this comment chain, as 'comments are not for extended discussion'. – vmg Oct 14 '16 at 16:54
  • Yep @vmg, very sorry about that, didn't realise it until then – Phroggyy Oct 14 '16 at 16:55
  • By the pigeonhole principle, you are guaranteed a collision if there are more entries than keys. – Emanuel Landeholm Oct 18 '16 at 09:12
  • OP are you familiar with bloom filters? "A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: false positives are possible [but false negatives are not]." http://billmill.org/bloomfilter-tutorial/ – Colonel Panic Oct 19 '16 at 14:32

6 Answers6

7

Finally, the strings have a format of XXXNNN (3 characters and 3 integers).

Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).

If X ∈ {A..Z} and N ∈ {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.


Here you go: a python script to generate all possible 17 million keys:

import itertools
from string import ascii_uppercase

for prefix in itertools.product(ascii_uppercase, repeat=3):
    for numeric in range(1000):
        print "%s%03d" % (''.join(prefix), numeric)   

and a simple C bitset filter:

#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
    return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
    int byteIndex = value / CHAR_BIT;
    unsigned char mask = 1 << (value % CHAR_BIT);

    unsigned char byte = filter[byteIndex];
    filter[byteIndex] = byte | mask;

    return byte & mask;
}

which for your purposes you'd use like so:

#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
    int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
    return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
    int alpha   = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
    int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
    int key = numeric + 1000 * alpha;
    return filterTestAndSet(filter, key);
}

#include <stdio.h>
int main() {
    unsigned char *filter = allocMyFilter();
    char key[8]; /* 6 chars + newline + nul */
    while (fgets(key, sizeof(key), stdin)) {
        if (testAndSetMyKey(filter, key)) {
            printf("collision: %s\n", key);
            return 1;
        }
    }
    return 0;
}

This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:

useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt

real    0m0.474s
user    0m0.436s
sys 0m0.036s

useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000

real    0m0.467s
user    0m0.452s
sys 0m0.016s

admittedly the input file is cached in memory for these runs.

Useless
  • 64,155
  • 6
  • 88
  • 132
  • Nice linear programming solution to the problem, using a 2MB table to avoid complex data structures or algorithms. The performance looks like n since a given input the bloom filter needs to be checked only once per element with a constant time lookup. – Michael Shopsin Oct 19 '16 at 15:44
  • 1
    Yep, absolutely. I charted a quick experiment over 1..7 million input rows, to check nothing untoward was happening, and up to measurement jitter it was straight as an arrow. – Useless Oct 19 '16 at 16:21
4

The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.

public static bool ThereAreDuplicates(string[] inputs)
        {
            var hashTable = new Hashtable();
            foreach (var input in inputs)
            {
                if (hashTable[input] != null)
                    return true;

                hashTable.Add(input, string.Empty);
            }
            return false;
        }
clcastro87
  • 41
  • 5
  • Is `hashTable.Add()` linear time? I suspect it would need to re-size itself from time to time if the number of item was not known prior to making the hash table. – chux - Reinstate Monica Oct 14 '16 at 14:44
  • It's _amortized_ average-case constant time. Some operations will be linear time, but so long as those happen with inverse-linear frequency, it amortizes out. And, some hash/key combinations will cause lots of collisions and degrade from constant to linear time again, but on average we hope that doesn't happen too often. – Useless Oct 14 '16 at 14:58
  • Agree about amortized average-case constant time, etc. Note: [OP later commented](http://stackoverflow.com/questions/40044744/fastest-algorithm-to-figure-out-if-an-array-has-at-least-one-duplicate/40046530#comment67368018_40044744) the file size was known at the beginning. – chux - Reinstate Monica Oct 14 '16 at 22:00
3

A fast but inefficient memory solution would use

// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 };  // or calloc() this
char buffer[100];

while (fgets(buffer, sizeof buffer, istream)) {
  unsigned long index = strtoul(buffer, NULL, 36);
  if (found[index]++) {
    Dupe_found();
    break;
  }
}

The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
2

Depending on how many different things there can be you have some options:

  • Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
  • Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
Hauleth
  • 22,873
  • 4
  • 61
  • 112
2

The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).

The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.

However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.

Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • I'm not sure I see how this would work with multiple CPUs... Firstly, I'm wasting cycles just splitting it up and starting processes. Now, you then state CPU N doing all integers in a given range. However, would that not require sorted input data? – Phroggyy Oct 14 '16 at 21:36
  • @Phroggyy: Starting threads is a "once only" cost. For huge arrays that cost is insignificant (dwarfed by the cost of processing a huge number of entries). For tiny arrays the cost of starting threads would be significant, but you wouldn't care about performance for tiny arrays in the first place. If you're expecting to have to handle everything from tiny arrays to huge arrays you can adjust the number of threads (e.g. 1 thread only if array is tiny, 2 threads if array is medium sized, .... up to the number of CPUs you have). – Brendan Oct 15 '16 at 11:20
  • Right @Brendan but is 6M really that huge? I have my current program running in 0.36s, and the slowest line of code is `filter[byteIndex] = byte | mask`, which is why I'm wondering if a second, or even third, thread is worth the cost – Phroggyy Oct 15 '16 at 11:23
  • @Phroggyy: It doesn't need the input to be sorted - if a CPU/thread is only handling integers within a certain range then it'd read from the array sequentially (a nice "cache friendly"/pre-fetchable access pattern) and simply skip over any integers that are outside that range. – Brendan Oct 15 '16 at 11:23
  • Right gotcha, so each process would be dealing with its own subset – Phroggyy Oct 15 '16 at 11:25
  • @Phroggyy: As a random estimate (based on many likely dodgy assumptions), spawning a thread probably costs around 1000 cycles, while 0.36 seconds is probably about 1000000000 cycles. For 4 CPUs you might be able to do it 0.10 seconds instead (not quite, but almost, 4 times faster). – Brendan Oct 15 '16 at 11:27
  • Right @Brendan, I see what you're saying, it seems like a great solution. However, how is file read speed affected by this? How would I process the file? Also, in my current implementation, file read is more than half the execution time, so I could never cut down more than that – Phroggyy Oct 15 '16 at 11:33
  • you were definitely right. Even spawning a _process_ rather than a _thread_ cut execution time by 40% – Phroggyy Oct 15 '16 at 12:42
2

Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :

void counting_sort(int a[],int n,int max)
{
     int count[max+1]={0},i;

     for(i=0;i<n;++i){
      count[a[i]]++;
       if (count[a[i]]>=2) return 1;
      }
      return 0;

}

Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).

You can see counting sort visualization to understand it better, here: https://www.cs.usfca.edu/~galles/visualization/CountingSort.html

Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

coder
  • 12,832
  • 5
  • 39
  • 53