faster string comparison for sorting, for unique()

Question

I have 4e7 std::strings, each about 3 to 30 characters long, with many duplicates. I'm putting them in a std::set.

Calling set::insert for each string becomes intractably slow well before it would complete with about 1e7 unique strings. So instead I push_back each string into a vector, sort() and unique() that, and then move the strings into a set.

It's still slow, but at least it finishes: 4 seconds to accumulate the vector, 30 more for sort(), 3 more for unique().

The bottleneck is sort(). But I don't need the strings to be lexicographically sorted! I just need duplicate strings to be contiguous, for unique(). Their order is irrelevant. Is there a simpler, faster string comparison function for sort() that I could use instead of its default one?

Or should I build the set faster by iterating over the vector with a hash table on the side, to skip duplicates? Or replace set with hash_set or unordered_set?

Edit: I'm building on Linux with g++ 4.8.4, with the only flags being -std=c++11 -O3.

I would definitely recommend you measure using `std::unordered_set`. — Some programmer dude, Aug 08 '19 at 17:44
Relevant, possibly a duplicate: https://stackoverflow.com/q/14023106/1896169 — Justin, Aug 08 '19 at 18:00
This blog (?) post has some discussion on the requested algorithm: https://lemire.me/blog/2008/05/01/i-am-seeking-an-efficient-algorithm-to-group-identical-values-in-an-array/ — Justin, Aug 08 '19 at 18:02
@CamilleGoudeseune I see no mention how you built your application that you are testing. Are you running an optimized build, or a "debug", non-optimized build? — PaulMcKenzie, Aug 08 '19 at 18:07
An idea assuming comparing the strings is part of the problem: you could precompute the hashes of the strings and sort based on that hash. Within each region of equal hashes, you'd have to sort again due to possible hash collisions, but it's still possible that this could be faster. — Justin, Aug 08 '19 at 18:07
If you just need to make the strings unique, I'd recommend using `std::unordered_set` (of a third-party library for hash sets, if you need additional performance. `std::unordered_set` is constrained in how performant it can be due to the requirements of it's interface) — Alecto Irene Perez, Aug 08 '19 at 19:19
The right data structure is unordered_set, and the duplicate link is not a perfect match, but it's saying you may also have issues with unnecessary temporaries while building your set. Because std::set has the same complexity as sorting the vector, if that shows a speed difference then you have issues with temporaries, and those will likely still be there in the unordered_set solution. But it's difficult for us to help there as you did not post the code. — Kenny Ostrom, Aug 08 '19 at 19:34
If you know your strings are never going to be more than 30 chars long, you might try replacing `std::string` with e.g. `char[32]` or something equivalent to that. Avoiding heap allocations could provide a significant speedup. — Jeremy Friesner, Aug 08 '19 at 21:18

Camille Goudeseune · Answer 1 · 2019-08-09T01:15:39.657

2

@Someprogrammerdude, @J.AntonioPerez, @KennyOstrom: std::unordered_set is 6x faster. Post an answer and I'll accept it. (This offer may have been lost in all those comments.)

vector<string> v;
loop { v.push_back(my_string[i]; }

Slow original:

sort(v.begin(), v.end());
v.erase(unique(v.begin(), v.end()), v.end());
set<string> noduplicates = set<string>(
  make_move_iterator(v.begin()), make_move_iterator(v.end()));

6x faster than the preceding code block:

unordered_set<string> noduplicates =
  unordered_set<string>(
  make_move_iterator(v.begin()), make_move_iterator(v.end()));

edited Aug 09 '19 at 01:15

answered Aug 08 '19 at 21:19

Camille Goudeseune

2,934
2
35
56

1

You can accept your own answer. It's still a valid answer and you've gone through the effort of testing it. – Hatted Rooster Aug 08 '19 at 21:22
6x faster than sorted_set or 6x faster than sort+unique? – davidbak Aug 08 '19 at 23:07
Than sort+erase+unique+move, i.e. the preceding code block. – Camille Goudeseune Aug 09 '19 at 01:16

faster string comparison for sorting, for unique()

1 Answers1