16

I am working on data structure where input is very large almost 1 TB. I need to load data into associative container.

Data has some duplicate entires so i am using multimap but someone suggested me to use map of vector instead of using this. May i know what is the difference performance wise?

 map<const char*, const char*, cmptr> mulmap;

 map <const char*, vector <const char*> ,cmptr> mmap;
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
Manish
  • 3,341
  • 15
  • 52
  • 87
  • 5
    Why don't you try them both, and find out? – Oliver Charlesworth Feb 18 '13 at 08:20
  • 7
    Actually, in both cases your performance is likely to be dominated by disk speed, unless you have a system with 1TB of RAM... – Oliver Charlesworth Feb 18 '13 at 08:24
  • 1
    If you want to use `const char*` as key, you must also supply a comparison predicate for this to work. It would be easier to use `std::map` instead. – Olaf Dietsche Feb 18 '13 at 08:29
  • 2
    Please. I beg you. Don't use `const char*` if you mean to use it as a string. Use `std::string` instead. Just... please! – Mark Garcia Feb 18 '13 at 08:30
  • What problem are you trying to solve? – Peter Wood Feb 18 '13 at 08:32
  • @MarkGarcia Why you beg him? He/She has a 1TB structure and if he/she can save structure that needed for strings to just one pointer and store that string somewhere else this is awesome. In normal programming use case `std::string` is much better than `const char*` but this is not always the case. `map` and `multimap` both provide a comparer that we can use to allow `const char*` as a key and even `unordered_map` and `unordered_multimap` has a template hash function! so it work perfectly with `C++` and it may be even the only case – BigBoss Feb 18 '13 at 08:33
  • 1
    i need to store the content of 1 million files into map/multimap and need to apply my algo to process the data and print the output. – Manish Feb 18 '13 at 08:33
  • 1
    http://stackoverflow.com/questions/7351153/multimap-vs-map-with-set – Pheonix Feb 18 '13 at 08:50
  • @user15662 Why does your algorithm require all the data to be in memory at the same time? – Peter Wood Feb 18 '13 at 08:59
  • 1
    Unless the keys are always string literals, you'd probably be better off using `std::string`. Otherwise, you're going to have lifetime issues. And if you're reading the data all in one go, then processing it, you'd probably be better off using `std::vector>`, and sorting the vector once after reading the entire data. `std::vector` will improve locality enormously, which will likely make a significant difference here. – James Kanze Feb 18 '13 at 09:10
  • http://stackoverflow.com/questions/8602068/whats-the-difference-between-stdmultimapkey-value-and-stdmapkey-stds – Ciro Santilli OurBigBook.com Dec 31 '16 at 22:03

2 Answers2

23

You are wasting your time thinking about map versus multimap. Suppose that the number of bins is N and the average number of items per bin is M.

A std::multimap<Key, Val> typically uses an RB tree with duplicate keys.

  • Fetch is O(log N + log M)
  • Insert is O(log N + log M)
  • Delete is O(log N + log M)
  • Iteration is O(1)

A std::map<Key, std::vector<Val>> typically uses an RB tree with unique keys.

  • Fetch is O(log N)
  • Insert is O(log N)
  • Delete is O(log N)
  • Iteration is O(1)

As you can see, the difference is not worth talking about unless M is very large.

However, the storage of both is limited by RAM. 1 TB is simply not feasible for most systems, and no motherboard I've heard of supports it.

You are better off using a database for 1 TB of data. You can choose almost any database for this task. Kyoto Cabinet is simple and does what you want, but you can also use PostgreSQL, MySQL, Sqlite, Dynamo, Redis, MongoDB, Cassandra, Voldemort...

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • 2
    For "1 TB of data", one actually needs to be careful on which database and how the data is accessed .. there be restrictions and such. –  Feb 18 '13 at 08:31
  • 2
    @user15662: I still recommend Kyoto Cabinet over `std::` anything for this kind of job. – Dietrich Epp Feb 18 '13 at 08:32
  • 2
    +1 @user15662 then having a rock-solid DB backend should be a given. I concur with Dietrich here. – WhozCraig Feb 18 '13 at 08:33
  • For those of us with merely human levels of ram, we could expect the performance of multimap to be roughly equivalent to map with vector? – Carbon Dec 01 '17 at 19:25
  • 1
    @Carbon: Depends on what you want to do with the data. In general, no. But vector is often the right choice. – Dietrich Epp Dec 01 '17 at 19:35
5

With 1 TB of input, I would use neither. Most likely, you don't have enough RAM. Use some on disk data structure like B-tree.

Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198