What is the fastest way to count the unique elements in a list of billion elements?

Question

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.

First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.

That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.

Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.

I am wondering maybe some of you know a better solution. (Computer has extremely high specification.)

By "unique elements" do you mean characters or strings? Is a string a single word? — Ragepotato, Jan 12 '10 at 22:18
Do you expect there to be many unique elements or is it likely that most strings are repeated? — JaakkoK, Jan 12 '10 at 22:21
Fastest way to code: add everything to a SQL Server table and issue a query. — Mehrdad Afshari, Jan 12 '10 at 22:25
Characters in strings limited to one byte and less (e.g. ANSI, ASCII) or Unicode or ... ? — ThinkJet, Jan 12 '10 at 22:28
"I need to find out the number of the unique elements" - are you counting multiple occurrences of a same string, finding whether the string is in the set, or doing something else? — Mathias, Jan 12 '10 at 22:29
Looks like I should have posted my comment as an answer. However, while this is applicable in practice most of the time, I believe it's not an acceptable answer to this question (as it's unlikely to be the fastest solution) and what if the OP wants to implement it in his own database engine? — Mehrdad Afshari, Jan 12 '10 at 22:35
The OP doesn't specify what fastest means: fastest runtime following implementation, or fastest to get going from beginning of development to completion of count? If the latter, use the database. No need to worry about memory issues - the database vendor has already done it for you. — Dathan, Jan 12 '10 at 22:44
Agree with comment by @Mehrdad Afshari. @Andras, please describe your priorities more clearly (execution speed, memory usage, time for solution delivery, if disk access alllowed, how frequently data modified etc.) — ThinkJet, Jan 12 '10 at 22:45
Maybe for such a high-performance and large-data application C# is not the best choice? It seems to me that a more memory-efficient language such as C++ might be better suited for the job. — Dirk Vollmar, Jan 12 '10 at 23:23
Is is true that a single object in .NET can only be 2GB? Does that apply to 64 bit .NET as well? — Arve, Jan 13 '10 at 01:32
Can someone who understands this question please edit it so I can understand it? Or email me and I'll edit it... — Norman Ramsey, Jan 13 '10 at 05:00

score 29 · Accepted Answer · answered Jan 12 '10 at 22:29

29

I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.

answered Jan 12 '10 at 22:29

D.Shawley

58,213
10
98
113

+1 FOr simple answer, Databases are partly designed for this kind of task. No sense "Reinventing the wheel". – Jamie Keeling Jan 12 '10 at 22:40
5

That really depends on his application constraints, and makes an assumption that might well not be valid. – Eric J. Jan 12 '10 at 22:43
2

This is a programming question, not a query question. (Yeah, queries are programs, but let's eschew that.) Plus the OP labeled the question as C#. – Thomas Eding Jan 12 '10 at 23:05
This is indeed easiest to code, but the OP asked for fastest. A customised algorithm like a DAWG (see Lee's answer) will beat any database for the OP's specific needs in both space and performance. – SPWorley Jan 13 '10 at 00:23
3

A database engine like SQL Server is optimized for large amounts of data. Any in-memory algorithm runs the risk of taking too much time and causing too much paging, and/or thread contention. I do not think you should rule out a database as possibly being fastest in this case. – John Saunders Jan 13 '10 at 01:22
3

This is a very bad idea, and will take an eternity compared to a single iteration over the data using a trie to keep track of strings you've seen. My only regret is that I can only downvote this once. – Terry Mahaffey Jan 13 '10 at 04:01
4

(1) databases are optimized for set-based functions - existence, intersection, counting, etc. (2) C# has database access the last time that I checked (3) if the data set is larger than available/efficient memory size, then a custom data structure becomes *very* difficult - think about how you would page out portions of a trie to disk and still make it efficient (4) don't rule out the cost of loading the data if you need to do it more than once (5) try writing a trie that multiple threads can traverse and allow for modification – D.Shawley Jan 13 '10 at 13:38
1

@Terry: I don't see a requirement for iterating over the data set. Querying for existence does not necessarily imply iteration. Based on the fact that the OP is talking about data sets that exceed 2GB, any in-memory data structure is likely to cause enough VM cache misses and swapping to slow it to a crawl. The only effective option for non-memory solutions is a database IMHO. – D.Shawley Jan 16 '10 at 16:16
2

+1 Most practical answer. I swear nobody understands the word "fast". If you spend 6 hours creating/testing your own custom hash table, you are slower than molasses. Writing something to populate a database and then writing something to query it would take me roughly ten minutes (plus transfer times). Computer time is cheap. Programmer time is **expensive**. Think about it. – Josh Stodola Feb 04 '10 at 16:10
2

@Josh: most programmers do not like to hear the answer _well don't write a program to solve the problem, use something that exists_ – D.Shawley Feb 04 '10 at 18:35
1

@D.Shawley I disagree. I think most programmers are lazy, and therefore that phrase is music to the ears. Re-inventing the wheel and optimizing prematurely are both huge wastes of time. The sooner a programmer understands that and utilizes tools that the previous generations have perfected, the more productive they shall be. – Josh Stodola Feb 04 '10 at 20:14

Lee · Answer 2 · 2010-01-13T01:16:01.903

23

I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.

edited Jan 13 '10 at 01:16

answered Jan 12 '10 at 22:23

Lee

142,018
20
234
287

I don't have hard data, but I would assume that a trie will be faster than a database. – David Rodríguez - dribeas Jan 12 '10 at 23:01
Added benefit: Tries are really really easy to implement. – Thomas Eding Jan 12 '10 at 23:07
3

Let's not mix up our Ns. Testing for membership in a DAWG would be O(n), but with n being the number of characters in the string, NOT the number of strings in the set. HUGE difference. – Lucas Jan 13 '10 at 00:22
I've used Directed Acyclic Word Graphs. Very effective. – Norman Ramsey Jan 13 '10 at 05:00
1

A trie with this many words might or might not fit into 2G, depending on the data, and the implementation of a trie node (even an index will take 4 bytes...). – comingstorm Jan 13 '10 at 09:41
Exactly what I said, as one of the first answers - why no love for Raja? – BlueRaja - Danny Pflughoeft Jan 13 '10 at 15:54

KirarinSnow · Answer 3 · 2010-01-13T00:17:47.703

This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.

If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.

This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.

Better than suggesting a database, but still - sorting the data is overkill, and not required by the problem space. Any solution which does so is not optimal. trie trees, however, are designed to solve almost this exact problem. — Terry Mahaffey, Jan 13 '10 at 04:04
@Terry Mahaffey: Comparison sort (mergesort for example) is not optimal. The constraints of the problem, however, allow for radix sort, which is optimal (asymptotically). The tokens being sorted are strings of bounded length and there is a bounded number of possible characters in each position. I do agree that tries are better (for space reasons), but not because radix sort is not optimal. — KirarinSnow, Jan 13 '10 at 23:21

score 4 · Answer 4 · answered Jan 12 '10 at 22:22

If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.

The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.

Also, because it's sorted, the retrieval and insertion time are both O log(n).

score 3 · Answer 5 · answered Jan 12 '10 at 22:28

Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.

score 2 · Answer 6 · edited May 23 '17 at 11:54

With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.

You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.

Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.

score 1 · Answer 7 · answered Jan 12 '10 at 22:23

1

divide and conquer - partition data by first 2 letters (say)

dictionary of xx=>dictionary of string=> count

answered Jan 12 '10 at 22:23

pm100

48,078
23
82
145

My inclination but I would make the first partition more effective. Don't key on two characters, key on say the first 16 bits of a hash of the string. – Loren Pechtel Jan 13 '10 at 04:43
to get a hash of the string, you need to scan the entire string. inspecting the first couple of characters may be faster ( though it probably will be limited by bus speeds, and since cache is loaded a line at a time, may not be ) – Pete Kirkham Jan 13 '10 at 22:52

score 1 · Answer 8 · edited Jan 24 '10 at 11:22

1

I would use a database, any database would do.

Probably the fastest because modern databases are optimized for speed and memory usage.

You need only one column with index, and then you can count the number of records.

edited Jan 24 '10 at 11:22

si618

16,580
12
67
84

answered Jan 12 '10 at 22:28

GvS

52,015
16
101
139

4

I doubt that a general purpose database can outperform a specialized, optimized algorithm in this case. General purpose DB's balance a number of competing needs (insert speed, update speed, query speed, memory vs. CPU). A specialized algorithm can be tuned according to the OP's needs. – Eric J. Jan 12 '10 at 23:22
But what's the fastest? Dump the data in a database, or select or invent and tune a specialized algorithm. If you can keep all in memory, and don't meet the internal limits of Array, List<> or Dictionary<>, implementing will be about the same, code performance probably faster. If you reach these limits however.... – GvS Jan 13 '10 at 09:07

Edward I · Answer 9 · 2010-01-24T12:02:49.290

+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.

But just for academic purposes, I will like to add my 2 cents.

-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.

Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.

Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.

The below is my solution for a one time duplicate counting with limits on how much space can be used.

If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:

if (a[i] == a[i+1])
    dupCount++;

If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.

I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).

Edit: SQl/DB solutions are good only when you need to store this data for a long duration.

score 0 · Answer 10 · answered Jan 12 '10 at 22:27

0

Have you tried a Hash-map (Dictionary in .Net)? Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.

If the strings are very similar, you could also write your own Trie implementation.

Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.

answered Jan 12 '10 at 22:27

BlueRaja - Danny Pflughoeft

84,206
33
197
283

Why would strings have to be "very similar" in order to use a Trie? – Michael H. Jan 13 '10 at 00:48
Because if every node is 26 extra pointers you have to have memory for. The less similar the strings are, the more nodes you have. – BlueRaja - Danny Pflughoeft Jan 13 '10 at 04:04

score 0 · Answer 11 · answered Jan 12 '10 at 22:37

0

A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.

answered Jan 12 '10 at 22:37

Hans Passant

922,412
146
1,693
2,536

Is there a difference in the maximum size of a CLR object on 32-bit and 64-bit operating systems? – Dirk Vollmar Jan 12 '10 at 22:40
Ok, so the limit is the maximum memory size of a process? – Dirk Vollmar Jan 12 '10 at 23:14
No, a 64-bit process has terabytes, depending on the size of the paging file. The 2GB restriction is an x64 instruction set issue, combined with an Int32 array indexing limit. – Hans Passant Jan 13 '10 at 00:18

score 0 · Answer 12 · answered Jan 12 '10 at 22:51

0

I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.

answered Jan 12 '10 at 22:51

Noon Silk

54,084
6
88
105

score 0 · Answer 13 · answered Nov 25 '19 at 05:33

If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.

What is the fastest way to count the unique elements in a list of billion elements?

13 Answers13