1

I have 15,000,000 std:vectors of 6 integers.

Those 15M vectors contain duplicates. Duplicate example:

(4,3,2,0,4,23)
(4,3,2,0,4,23)

I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)

Is there an algorithm in the std C++ (can be x11) that does that in one shot?

Windows, 4GB RAM, 30+GB hdd

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
Lazik
  • 2,480
  • 2
  • 25
  • 31
  • When you say duplicates, do you mean that they contain the same integers in the same positions, or just the same integers period? – Jessie Dec 16 '13 at 00:53
  • @Lazik - You are telling a fib – Ed Heal Dec 16 '13 at 00:54
  • what is stopping you from using an unordered_map? If you need less memory, you can always use a map. – leorex Dec 16 '13 at 00:55
  • Define duplicate precisely. Does the whole vector have to contain the same values in the same order in order to be a duplicate? –  Dec 16 '13 at 00:55
  • Yes order matters (1,1,1,1,1,2) is not a duplicate of (2,1,1,1,1,1) – Lazik Dec 16 '13 at 01:00
  • possible duplicate of [what are the fast algorithms to find duplicate elements in a collection and group them?](http://stackoverflow.com/questions/1332527/what-are-the-fast-algorithms-to-find-duplicate-elements-in-a-collection-and-grou) –  Dec 16 '13 at 01:09
  • I flagged as a duplicate of another question, whose content is very detailed. What must have thrown you off your search for duplicates was naming this a "frequency counter". –  Dec 16 '13 at 01:10
  • 1
    If your vectors are of size 6 each, the constant size `std::array` is definitely better regarding your memory footprint, since it stores the elements in place and doesn't use dynamic memory allocation. – leemes Dec 16 '13 at 01:16
  • Thanks for pointing it out, I will def. switch to std::array. I'm new to the std, how would I go about dynamic allocation of an array of array? – Lazik Dec 16 '13 at 01:34
  • 1
    I'd recommend `std::vector` for the *outer* data structure, only `std::array` for the *inner* ones, since they are tiny and equally sized. `std::vector` does the allocation for you, no need to manually allocate anything. – leemes Dec 16 '13 at 01:42

3 Answers3

9

There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.

For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.

The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.

typedef std::vector<int> VectorType;        // Please consider std::array<int,6>!

std::unordered_map<VectorType, int> counters;

for (VectorType vec : vectors) {
    counters[vec]++;
}

For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.

In order to reduce the memory footprint of this solution, try this:

If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.

std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;

for (VectorType vec : vectors) {
    counters[hashFunc(vec)]++;
}

This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).

leemes
  • 44,967
  • 21
  • 135
  • 183
  • +1 for a sound approach to the question asked. If any sorting happens to be wanted by the `VectorType` values, directly using a `map` - instead of `unordered_map` - would likely be better. – Tony Delroy Dec 16 '13 at 01:04
  • 1
    @TonyD If performance really matters (it seems in this case), I'd only use ordered `map` *after* finding unique values. Otherwise you have `O(n log n)` where `n` is the total count, but if you only sort the resulting list of unique values, you have `O(m log m)` where `m` is the number of uniques. – leemes Dec 16 '13 at 01:06
  • @Potatoswatter Thanks for pointing this out! Never saw this data structure. It seems perfect for counting occurrences. – leemes Dec 16 '13 at 01:13
  • @leemes: Not clear to me that performance is important - 15m vectors can be handled in a fraction of a second - but depends how often it's done. "Otherwise you have O(n log n)..." - given n is known, we can do better than talk in big O notation - we're talking about around 14 comparisons per `map` insertion with comparison very cheap, versus having to hash the 6-element vectors then compare the hash. By the time you factor in the memory allocations for the additional copy to an ordered map, directly populating the map looks good, but obviously you'd want to benchmark. – Tony Delroy Dec 16 '13 at 01:31
  • Separately, if performance did matter, or memory use did, you'd probably be dealing in indices into the original vector or references to its elements rather than copying it by value. Bit more code though. – Tony Delroy Dec 16 '13 at 01:33
  • @Potatoswatter: doesn't `unordered_multiset` maintain multiple copies of the duplicate values? May not be a problem, but worth noting that that would be less space and time efficient that this answer's posted code if the number of duplicated exceeded 1 in 6 (sizeof a counter versus sizeof extra vectors, assuming all numbers are 32 bit). – Tony Delroy Dec 16 '13 at 01:42
  • @TonyD Yes, I've been trying to delete my previous comment for the last hour but my connection was flakey :( – Potatoswatter Dec 16 '13 at 02:05
  • I removed this in my answer and instead expanded the solution which maintains a hash map without the full size key but the hash values as the key. This should be smaller but still has more overhead than your answer, @Potatoswatter. – leemes Dec 16 '13 at 11:16
  • ```unordered_map counters;``` a program with only the incudes, typedef and this line fails to compile. I am using mingw g++ 4.8. I tried compiling with various options : -std=gnu++0x. What am I doing wrong? Does not compile in MS studio 2010 either. – Lazik Dec 17 '13 at 02:28
  • Then you don't have the typedef as above. Either `typedef std::vector VectorType` or `typedef std::array VectorType` but not mixed. – leemes Dec 17 '13 at 15:01
  • @leemes Have a look, this is the code. http://pastebin.com/dBtz6EYT Looks like it is unable to hash the vector type. – Lazik Dec 17 '13 at 19:46
  • 1
    Oh you're right. Try this: http://ideone.com/OlsjXR Note there is a helper class to hash the values. Use this helper class in the type of the unordered map has the hash functor (what I have underlined). – leemes Dec 17 '13 at 20:02
3

Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.

Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.

The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
-1

You should convert each vector element to string one by one like this "4,3,2,0,4,23". Then add them into a new string vector by controlling their existance with find() function.

If you need original vector, convert string vector to another integer sequence vector. If you do not need delete duplicated elements while making sting vector.

karakale
  • 126
  • 11
  • 1
    What's the purpose of the string? You can as easily find and compare vectors of integers in data structures since there are appropriate operator overloads for `std::vector`. No need for a human readable representation within an algorithm like this. – leemes Dec 16 '13 at 01:39