0

I have a data structure question. I have a collection of strings which grows throughout the lifetime of a process. I want to be able to pass references to these strings around the program with varying durations. I don't want to add duplicates to the collection, so when I pass one in I want to be returned a reference to the existing entry, thus:

const std::string& add_new_entry(const std::string&)
{
    // Check if string exists
    // Return reference if it does
    // Otherwise add to collection
    // Return reference to it
}

The most naive implementation would be a list of strings and a call to std::find every time, but I can't help feeling that's deeply suboptimal, particularly since I'm dealing with upwards of 50,000 strings. I've created an extending array container so I can add elements arbitrarily without forcing a resize and move, and I'm indexing them with a std::set of std::string* ordered alphabetically using a dereferencing comparison predicate: can anyone do any better? 15 string comparisons seems like a lot.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
hatcat
  • 1,876
  • 1
  • 18
  • 37
  • *"I can't help feeling that's deeply suboptimal"* - sometimes you might end up really surprised by the speed of algorithms that seem to be inefficient at first glance. Maybe it's not your case, but I wanted to post a disclaimer about **avoiding premature optimization** anyway :) – LihO Feb 23 '13 at 13:49
  • Yes, you're doing *(max)* 15 string comparisons, but you won't reach this number that often, and many of these could just compare one or two characters. – Bernhard Barker Feb 23 '13 at 13:50
  • 2
    Why use an array at all, why not just `std::set`? – Bernhard Barker Feb 23 '13 at 13:58
  • 1
    Maybe a [trie](http://en.wikipedia.org/wiki/Trie) is what you want, but I don't know of standard implementations in C++. – Nobody moving away from SE Feb 23 '13 at 14:01
  • I agree with Dukeling, having an extra Array seems unnecessary because std::set doesn't invalidate iterators. Same goes for hash_set (which can be worth trying out). Using a set isn't premature optimization IMHO, it's smart because 1. It has the right semantics. 2. It's rather effecient. – Just another metaprogrammer Feb 23 '13 at 14:25
  • I remember reading about a tree where the more searched for nodes are pushed towards the top of the tree, thus resulting in better overall performance, but I can't remember the name of it now. Also not sure whether it worked for sorted trees. This obviously won't be in the STL. – Bernhard Barker Feb 23 '13 at 14:32
  • @Dukeling: you are thinking about an AVL tree, I think. – Matthieu M. Feb 23 '13 at 14:41
  • @MatthieuM. I believe AVL trees are only for balancing. It doesn't move more common nodes to the top. – Bernhard Barker Feb 23 '13 at 14:47

5 Answers5

2

To get rid of the O(log n) performance of set, you could use unordered_set which uses hashing (and is O(1)) (or hash_set which is essentially the same, but only supported by some compilers).

Given that you're doing (max) 15 string comparisons, you don't reach this maximum all the time, and many of these could just compare one or two characters, it's quite possible that generating the hash for unordered_set (and dealing with hash conflicts) would take longer than finding the value in the set.

Also, why not get rid of the array and just use std::set<std::string> instead? You can still return a reference all the same:

const string& add_new_entry(const string& str)
{
    set<string>::iterator iter = yourSet.find(str);
    if (iter == yourSet.end())
      return *yourSet.insert(str).first;
    return *iter;
}

Test.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
  • 1
    Note: actually, on a full balanced binary search tree you perform `log2(N)` comparisons ~50% of the times; so I would not say that 15 is not reached often... – Matthieu M. Feb 23 '13 at 14:43
  • insert NOT invalidating iterators was the piece of information I was misremembering. unordered_set DOES invalidate iterators on insert if it causes a rehashing, so I'll stick with set. Thanks. By the way, why did you edit out my sign off? – hatcat Feb 23 '13 at 17:29
  • @hatcat Why does invalidating iterators matter? Even if the iterator is invalidated, the reference to the string will still be valid. Yes I did. ["Thanks" is ... clutter and does not need to be in the post](http://meta.stackexchange.com/a/97137/206447). Also [this](http://meta.stackexchange.com/questions/2950/should-hi-thanks-taglines-and-salutations-be-removed-from-posts). – Bernhard Barker Feb 23 '13 at 18:35
  • @hatcat [It appears that `set` isn't invalidated on insert/erase](http://stackoverflow.com/a/6438087/1711796). But [you're perfectly right](http://ideone.com/arg89j), I don't know why it works like that though, maintaining referential integrity can't be difficult. – Bernhard Barker Feb 23 '13 at 20:59
1

Optimization is always possible, and occasionally very worthwhile, but for 50,000 entries I'm guessing it might not be necessary. Give that it is actually necessary, there's a few things you could try.

Firstly, if some entries are use more commonly than others, you could store them in a separate popular words dictionary, which you search first. To see if this is worthwhile, store a counter against each dictionary entry, incrementing it each time the entry is accessed, and have a look at these counters over a prolonged testing period.

Another thing worth having is a fixed size array of dictionaries, say 26^3 = 17576, where the first three letters of the entry are used to select the dictionary to search. This drops you down to o(1) for words of three letters or less, and drastically reduces your search time for the remaining entries.

SmacL
  • 22,555
  • 12
  • 95
  • 149
0

I would probably just use std::set, possibly wrapping its iterator in a small class checking for invalidation, so you can keep the iterators instead of pointers.

Don't prematurely optimize. Did you profile that code? Are you 100% sure that this is the bottleneck?

Bartek Banachewicz
  • 38,596
  • 7
  • 91
  • 135
  • Does the downvoter have the courage to reveal himself and state the reason? – Bartek Banachewicz Feb 23 '13 at 13:57
  • OP already mentions using an `std::set`, so this may be better suited as a comment. If you want to say things like `"wrapping its iterator in a small class checking for invalidation"`, you have to explain what that means (or maybe it's just me). – Bernhard Barker Feb 23 '13 at 14:01
0

Use a map. You won't have to search through your array/list.

Srikant Krishna
  • 881
  • 6
  • 11
-1

std::hash_set i suppose is the way to go

Viktor Sehr
  • 12,825
  • 5
  • 58
  • 90