1

What I need is an ordered, associative container with string keys, valued by a vector of numbers. Further, I need O(1) insertion time.

My description sounds abstract, I shall give you a scenario:

There is an online test. When a person takes this test his name is added into a database. People may take the test repeatedly if they like. All their scores will be recorded under their name (which is unique). For example, David, Tom, Alice come and take the exam a few times. The program should be able to print out the output in following format:

Alice 65 70 84
David 98 97 93
Tom   100 45 
...

As you can see their names should be printed out in lexicographical order. Whenever someone takes the test, his score is added to the database. Since many many people will come take the test, this must be O(1) time complexity. However printing out the database also happens frequently; say once every second. It would therefore be advantageous to not have to explicitly sort on each display.

What data structure can I use here? (STL is preferred.) I'm currently using an unordered_map since it gives me O(1) insertion, but it can't iterate the keys in lexicographical order.

phs
  • 10,687
  • 4
  • 58
  • 84
Arch1tect
  • 4,128
  • 10
  • 48
  • 69
  • If you often insert keys, then use a linked list, that inserts in O(1). If you often retrieve them, use an array, that has a lookup time of O(1). –  Mar 30 '13 at 06:12
  • 2
    IF "every one hour" is often for you, just go ahead and choose whatever you want. It won't make a difference. – Blindy Mar 30 '13 at 06:13
  • @H2CO3 Using a (sorted) linked list requires `O(n)` to find the correct position of insertion. – timrau Mar 30 '13 at 06:14
  • @Blindy I changed it to every second... Do you think I have to sort the strings? – Arch1tect Mar 30 '13 at 06:17
  • 1
    Using a `trie` will give you natural lexicographical order and insertion will be `O(word length)`. Not `O(1)` but if word length << number of records - it' probably good enough. – SomeWittyUsername Mar 30 '13 at 06:21
  • @timrau "that inserts in O(1)" does ***not*** imply it has an O(1) lookup time as well (if it had, linked list would be the only data structure we use). Don't twist my words. –  Mar 30 '13 at 06:38
  • 1
    @Arch1tect: I very strongly suggest that you use a `std::map`. Although it has logarithmic lookup time, unless you really really need O(1) insertion and lookup (which you probably don't), it scales very well. I know that you say that "many many" names will be in this data structure, but I'm guessing that will be at most a few thousands; won't it? – yzt Mar 30 '13 at 06:40
  • @Arch1tect: Again, I very strongly suggest that you write a simple version using `std::map` and check whether you actually need something faster. – yzt Mar 30 '13 at 06:54

3 Answers3

4

A container that can insert in O(1) and give ordered iteration in O(n) could be used to sort strings in linear time.

We can immediately conclude that this container doesn't work with a comparator alone, since comparator-based sorts have a lower bound of O(n log n).

There are only a few sorting algorithms that sort in linear time, and they typically need knowledge of your key space to work. An incremental, "online" version of such an algorithm could be useful to you, and the incrementally-built internal data it uses would then become your container.

Here is a discussion of linear-time sort algorithms.

Community
  • 1
  • 1
phs
  • 10,687
  • 4
  • 58
  • 84
1

Considering that real O(1) operation in this case is a very complicated problem, and you prefer using the STL, I'd suggest something like the following data structures. Not that this does not satisfy your Big-Oh requirements, and it's not the most efficient way you can implement it even with the STL, but it's simple and efficient.

Your main data structure is a std::map. But in order to accelerate lookups to O(1), you can use an std::unordered_map like this:

using std::map, std::string, std::vector, std::unordered_map;

typedef map<string, vector<int> > TestTakersMap;
typedef unordered_map<string, TestTakersMap::iterator> LookupAccelerator;

Now your various operations will be:

  • Add a new person: You insert into the map, but also you add the name and the iterator in the map where you inserted the new record into the unordered_map. O(log(N)).
  • Look up a person: Using the unordered_map, you can get the iterator and the data for that person. Expected O(1).
  • Add a new score: You find the person using the unordered_map, then use the iterator you got to append a new score. Amortized O(1).
  • Printing out all names and scores: You iterator over the map itself and get the names in lexicographical order, with no added sort step. Can be considered O(S), where S is the total number of scores for all participants.

Note that in all this, your bottleneck will be your cache, and all this pointer chasing and jumping around in memory will not help you at all. Of course, this depends on several other factors, e.g., how many names you actually got, how many scores per person you have, how often you add a new person, how often you add a new score, how often you print out all the names and scores, how much more data you have and need per person and per test, etc.

UPDATE: You can do a basic operations like the following. The includes, etc. look like this:

#include <map>
#include <string>
#include <unordered_map>
#include <vector>

using std::map;
using std::string;
using std::unordered_map;
using std::vector;

This is a very simple class to do some of the operations you want. Note that I'm using C++11 features (auto, emplace, ...) but don't take this code as particularly good programming style; I can't vouch for that.

class TestScores
{
private:
    typedef int ScoreType;
    typedef vector<ScoreType> ScoreList;
    typedef map<string, ScoreList> TestTakersMap;
    typedef unordered_map<string, TestTakersMap::iterator> LookupAccelerator;

public:
    bool hasName (string const & new_name) const
    {
        return m_lookup.end() != m_lookup.find (new_name);
    }

    // Returns true if the name is really new
    bool addName (string const & new_name)
    {
        if (hasName(new_name))
            return false; // name already in there

        auto i = m_takers.emplace (new_name, vector<int>()).first;
        m_lookup.emplace (new_name, i);

        return true;
    }

    ScoreList const & getScores (string const & name) const
    {
        // This redirects to the private, non-const version
        return const_cast<TestScores *>(this)->getScores(name);
    }

    void addScore (string const & name, ScoreType new_score)
    {
        getScores(name).push_back (new_score);
    }

private:
    // If the name doesn't already exist, it is added!
    ScoreList & getScores (string const & name)
    {
        if (!hasName(name))
            addName (name);

        return m_lookup[name]->second;
    }

private:
    TestTakersMap m_takers;
    LookupAccelerator m_lookup;
};
yzt
  • 8,873
  • 1
  • 35
  • 44
  • Isn't insert still O(logN) ? what's the advantage here?? – Arch1tect Apr 05 '13 at 02:08
  • ` * (LookupAccelerator[name])[name].pushTesterScore(x) ` this is how I insert? – Arch1tect Apr 05 '13 at 02:10
  • I see... it speeds up the look up time after the person exist! Great idea!! Is this the right way to look up? ` (LookupAccelerator[name])[name].pushTesterScore(x)` – Arch1tect Apr 05 '13 at 02:16
  • never mind, I read something wrong... please show me how to write the lookup code in this implementation when you have time. – Arch1tect Apr 05 '13 at 02:56
  • @Arch1tect: I updated my answer to add sample usage. Does that clear things up? – yzt Apr 05 '13 at 13:02
  • yeah, thanks, originally I didn't know you need to set up bool etc to check existence and thought it's done automatically so I was confused. I shall compare the time difference of this implementation and a simple map when I have time. thanks again for answering/commenting many times on this question:) – Arch1tect Apr 05 '13 at 19:06
  • @Blinky: Regarding the edit you did on my code ( http://stackoverflow.com/revisions/15716065/2 , adding a space between `>>`), while I appreciate it, since the context of my answer is C++11, I thought to use all features of the language! Emmm... actually, is `std::unordered_map` a TR1 feature or a C++11 feature?! – yzt Apr 08 '13 at 19:02
0

If you are really serious about the size of your data-set being very large, and that you absolutely need efficient insertion, lookup and lexicographical iteration, you can check out Judy Arrays. Judy arrays are fast, memory-efficient and trie-like associative data structures.

You can check out these two implementations:

yzt
  • 8,873
  • 1
  • 35
  • 44
  • 1
    Meh if he's serious about this he should use a DBMS, there's plenty of free ones out there like MySQL that already implement prune trees indexes over large amounts of data. The truth of the matter is though that a simple list will be more than enough and he's micro-optimizing. – Blindy Mar 30 '13 at 17:23
  • Agreed. I figured it's some kind of homework or test. An application like this would absolutely need persistence of data too, and the most obvious, logical and efficient solution is using a DBMS. – yzt Mar 31 '13 at 11:29