1

I have 4e7 std::strings, each about 3 to 30 characters long, with many duplicates. I'm putting them in a std::set.

Calling set::insert for each string becomes intractably slow well before it would complete with about 1e7 unique strings. So instead I push_back each string into a vector, sort() and unique() that, and then move the strings into a set.

It's still slow, but at least it finishes: 4 seconds to accumulate the vector, 30 more for sort(), 3 more for unique().

The bottleneck is sort(). But I don't need the strings to be lexicographically sorted! I just need duplicate strings to be contiguous, for unique(). Their order is irrelevant. Is there a simpler, faster string comparison function for sort() that I could use instead of its default one?

Or should I build the set faster by iterating over the vector with a hash table on the side, to skip duplicates? Or replace set with hash_set or unordered_set?

Edit: I'm building on Linux with g++ 4.8.4, with the only flags being -std=c++11 -O3.

Camille Goudeseune
  • 2,934
  • 2
  • 35
  • 56
  • 2
    I would definitely recommend you measure using `std::unordered_set`. – Some programmer dude Aug 08 '19 at 17:44
  • 1
    Relevant, possibly a duplicate: https://stackoverflow.com/q/14023106/1896169 – Justin Aug 08 '19 at 18:00
  • This blog (?) post has some discussion on the requested algorithm: https://lemire.me/blog/2008/05/01/i-am-seeking-an-efficient-algorithm-to-group-identical-values-in-an-array/ – Justin Aug 08 '19 at 18:02
  • @CamilleGoudeseune I see no mention how you built your application that you are testing. Are you running an optimized build, or a "debug", non-optimized build? – PaulMcKenzie Aug 08 '19 at 18:07
  • An idea assuming comparing the strings is part of the problem: you could precompute the hashes of the strings and sort based on that hash. Within each region of equal hashes, you'd have to sort again due to possible hash collisions, but it's still possible that this could be faster. – Justin Aug 08 '19 at 18:07
  • If you just need to make the strings unique, I'd recommend using `std::unordered_set` (of a third-party library for hash sets, if you need additional performance. `std::unordered_set` is constrained in how performant it can be due to the requirements of it's interface) – Alecto Irene Perez Aug 08 '19 at 19:19
  • The right data structure is unordered_set, and the duplicate link is not a perfect match, but it's saying you may also have issues with unnecessary temporaries while building your set. Because std::set has the same complexity as sorting the vector, if that shows a speed difference then you have issues with temporaries, and those will likely still be there in the unordered_set solution. But it's difficult for us to help there as you did not post the code. – Kenny Ostrom Aug 08 '19 at 19:34
  • 1
    If you know your strings are never going to be more than 30 chars long, you might try replacing `std::string` with e.g. `char[32]` or something equivalent to that. Avoiding heap allocations could provide a significant speedup. – Jeremy Friesner Aug 08 '19 at 21:18

1 Answers1

2

@Someprogrammerdude, @J.AntonioPerez, @KennyOstrom: std::unordered_set is 6x faster. Post an answer and I'll accept it. (This offer may have been lost in all those comments.)

vector<string> v;
loop { v.push_back(my_string[i]; }

Slow original:

sort(v.begin(), v.end());
v.erase(unique(v.begin(), v.end()), v.end());
set<string> noduplicates = set<string>(
  make_move_iterator(v.begin()), make_move_iterator(v.end()));

6x faster than the preceding code block:

unordered_set<string> noduplicates =
  unordered_set<string>(
  make_move_iterator(v.begin()), make_move_iterator(v.end()));
Camille Goudeseune
  • 2,934
  • 2
  • 35
  • 56