5

We have 48,16,703 entries in this format.

1 abc
2 def
...
...
4816702 blah
4816703 blah_blah

Since the number of entries are quite big, I am worried that std::map would take much time during insertion since it need to do the balancing as well for each insertion.

Only inserting these entries into the map takes a lot of time. I am doing

map[first] = second;

Two questions: 1. Am I correct in using std::map for these kind of cases? 2. Am I correct in inserting like the above way. OR Should I use map.insert()

I am sorry for not doing the experiments and writing the absolute numbers but we want an general consensus if we are doing the right thing or not.

Also, they keys are not consecutive always..

P.S. Of-course, later we will need to access that map as well to get the values corresponding to the keys.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
Rahul Bhargava
  • 528
  • 1
  • 6
  • 17
  • 6
    If the keys are just a consecutive sequence of integers you could just use an array or a vector – Tharwen Dec 10 '18 at 12:23
  • 1
    @Tharwen, Thank you for quick answer but they are not consecutive. That is the bad part. Edited the question. – Rahul Bhargava Dec 10 '18 at 12:25
  • 2
    And, if not, `std::unordered_map` should be more appropriate, since its insertion average time complexity is constant. – Daniel Langr Dec 10 '18 at 12:25
  • 2
    Even if the numbers are not consecutive, but there are a lot of entries, e.g. more than half of the largest index, and you care more about speed than size, it might still make more sense to use an array or a vector, which have o(1) insertion and retrieval. – Rotem Dec 10 '18 at 12:28
  • @Rotem, Thank you. I understand that. What if there are holes in the format. I mean there are indexes missing? – Rahul Bhargava Dec 10 '18 at 12:30
  • @RahulBhargava Just don't put anything in that index then? – Rotem Dec 10 '18 at 12:31
  • 1
    Recommend to analyse the performance first using http://valgrind.org/docs/manual/cl-manual.html. If you do not want to iterator in sorted order, use `std::unordered_map` – PraAnj Dec 10 '18 at 12:31
  • Can you have empty strings? If not, you can indicate void entries by using empty strings. – Matthieu Brucher Dec 10 '18 at 12:31
  • @MatthieuBrucher Another option, though less convenient, is to hold a separate `std::vector` indicating which indices contain a value. – Rotem Dec 10 '18 at 12:34
  • @Rotem, If I do not put anything in the index, how can I pull that value information back later in the program? – Rahul Bhargava Dec 10 '18 at 12:35
  • 1
    @Rotem that would harm cache locality though.. – gsamaras Dec 10 '18 at 12:35
  • @RahulBhargava Not sure what you mean. I'm assuming the value type is `std::string`, which is default initialized to an empty string. But why would you need to pull values from indices which don't exist in your input? – Rotem Dec 10 '18 at 12:36
  • @gsamaras True but possibly still better than maps overall. Guessing is futile, requires benchmarking. The question doesn't indicate how data is used later. Is it iterated in order? Is it randomly retrieved? If the latter, cache locality will be irrelevant anyways. – Rotem Dec 10 '18 at 12:37
  • 2
    If your sequence of keys is already sorted, try `map.insert(map.end(), {first, second})`. [Demo](https://ideone.com/wrgAkg). It will be a win even if they are *mostly* consecutive. – n. m. could be an AI Dec 10 '18 at 12:39
  • @Rotem, My point is that I will need to access that key-value relationship later. Yes, value are strings. – Rahul Bhargava Dec 10 '18 at 12:42
  • @RahulBhargava Sorry, I'm still not seeing the problem with that. – Rotem Dec 10 '18 at 12:43
  • @Rotem, So you are saying that : Create an vector, iterate over each line, put the value at corresponding key location? Did I get that correctly? And the keys which do not exists will have empty value. – Rahul Bhargava Dec 10 '18 at 12:47
  • @RahulBhargava Yes, pretty much. Of course you should pre-size your vector to the size of the largest index + 1. Do you see an issue with this? – Rotem Dec 10 '18 at 12:48
  • @Rotem, Actually that make sense to me. Please give me some time. I will think over it and get back to you. Thanks much. – Rahul Bhargava Dec 10 '18 at 12:49
  • @Rotem that's a degenerate hash map (with hash == identity and no collisions) – n. m. could be an AI Dec 10 '18 at 16:02
  • @n.m. What are you referring to? All arrays are hash maps by that definition. – Rotem Dec 10 '18 at 16:03
  • @Rotem you can definitely view all arrays this way, but an array with lots of "unused" entries mears more resemblance to the general hash map than, say, a matrix. – n. m. could be an AI Dec 10 '18 at 16:19
  • @Rotem, The keys could be anything. Not necessarily start from 0/1. They could start from 1 million as well. Basically this key is just an reference that if I want to use string 'abc' later, use 1 instead and so on. – Rahul Bhargava Dec 11 '18 at 06:37
  • @RahulBhargava It's still doable if you know the lowest index, then you can store a constant 'offset' to be applied between key and array index. But unless speed is a critical factor here, the other solutions are starting to look simpler and more practical. – Rotem Dec 11 '18 at 08:12

2 Answers2

7

If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range.
It’s the same complexity as std::map, but far less allocations.

Daniel
  • 30,896
  • 18
  • 85
  • 139
4

Use an std::unordered_map, which has much better insertion time complexity than std::map, as the reference mentions:

Complexity

Single element insertions:
    Average case: constant.
    Worst case: linear in container size.

Multiple elements insertion:
    Average case: linear in the number of elements inserted.
    Worst case: N*(size+1): number of elements inserted times the container size plus one.

May trigger a rehash (not included in the complexity above).

That's better than the logarithmic time complexity of std::map's insertion.

Note: std::map's insertion can enjoy "amortized constant if a hint is given and the position given is the optimal.". If that's the case for you, then use a map (if a vector is not applicable).

@n.m. provides a representative Live demo

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Except if they keep on rehashing. – Matthieu Brucher Dec 10 '18 at 12:30
  • 1
    Another considerable advantage is possibly lower memory overhead. For instance, in libstdc++ `std::map` is implemented as a red-black tree. Each stored elements has 32 bytes of memory overhead (64bit arch - left/right/parent node pointer and color flag). For hash table, likely one or two pointers are used instead. – Daniel Langr Dec 10 '18 at 12:31
  • Yeah Matthieu, that's why the OP should benchmark, since for his application, it might be a problem, but I think it won't. @DanielLangr indeed, although I couldn't find [any support](https://stackoverflow.com/questions/31112852/how-stdunordered-map-is-implemented) of that, that's why I am not including it in my answer. – gsamaras Dec 10 '18 at 12:34
  • 1
    @gsamaras Sure, it's implementation defined. Anyone can explore source code of libstdc++ or libc++ or Microsoft's implementation to find out more details. – Daniel Langr Dec 10 '18 at 12:37
  • @n.m. I was thinking about that too... Updated my answer, thanks! I borrowed your demo, hope you don't mind. :) – gsamaras Dec 10 '18 at 12:49