0

It is follow-up question for this MIC question. When adding items to the vector of reference wrappers I spend about 80% of time inside ++ operator whatever iterating approach I choose.
The query works as following

VersionView getVersionData(int subdeliveryGroupId, int retargetingId,
                             const std::wstring &flightName) const {
    VersionView versions;
    for (auto i = 0; i < 3; ++i) {
      for (auto j = 0; j < 3; ++j) {
        versions.insert(m_data.get<mvKey>().equal_range(boost::make_tuple(subdeliveryGroupId + i, retargetingId + j,
                                 flightName)));
      }
    }
    return versions;
  }

I've tried following ways to fill the reference wrapper

template <typename InputRange> void insert(const InputRange &rng) {
    // 1)   base::insert(end(), rng.first, rng.second); // 12ms
    // 2)   std::copy(rng.first, rng.second, std::back_inserter(*this)); // 6ms
    /* 3)   size_t start = size();  // 12ms
                    auto tmp = std::reference_wrapper<const
       VersionData>(VersionData(0,0,L""));
                    resize(start + boost::size(rng), tmp);
                    auto beg = rng.first;
                    for (;beg != rng.second; ++beg, ++start)
                    {
                         this->operator[](start) = std::reference_wrapper<const VersionData>(*beg);
                    }
    */
    std::copy(rng.first, rng.second, std::back_inserter(*this));
  }

Whatever I do I pay for operator ++ or the size method which just increments the iterator - meaning I'm still stuck in ++. So the question is if there is a way to iterate result ranges faster. If there is no such a way is it worth to try and go down the implementation of equal_range adding new argument which holds reference to the container of reference_wrapper which will be filled with results instead of creating range?

EDIT 1: sample code http://coliru.stacked-crooked.com/a/8b82857d302e4a06/
Due to this bug it will not compile on Coliru
EDIT 2: Call tree, with time spent in operator ++
Calltree Hotpath EDIT 3: Some concrete stuff. First of all I didnt started this thread just because the operator++ takes too much time in overall execution time and I dont like it just "because" but at this very moment it is the major bottleneck in our performance tests. Each request usually processed in hundreds of microseconds, request similar to this one (they are somewhat more complex) are processed ~1000-1500 micro and it is still acceptable. The original problem was that once the number of items in datastructure grows to hundreds of thousands the performance deteriorates to something like 20 milliseconds. Now after switching to MIC (which drastically improved the code readability, maintainability and overall elegance) I can reach something like 13 milliseconds per request of which 80%-90% spent in operator++. Now the question if this could be improved somehow or should I look for some tar and feathers for me? :)

Community
  • 1
  • 1
kreuzerkrieg
  • 3,009
  • 3
  • 28
  • 59
  • 1
    "If there is no such a way is it worth to try and go down the implementation of equal_range adding new argument which holds reference to the container of reference_wrapper which will be filled with results instead of creating range?" - what are you talking about? Of course there will be no difference if you copy the implementation. Also, "creating a range" is not a costly operation. Have you even profiled **what** in `operator++` is taking so much time? Likely, you're just looking at the cost of traversing a node-based data structure. – sehe Dec 21 '14 at 13:02
  • I'm going to say it again: if you want serious help, **at least** post the datastructure declaration involved. Make the sample selfcontained and include some data that exhibits the bad performance you are talking about. – sehe Dec 21 '14 at 13:03
  • you are completely right. I've to disable inlining completely to see the exact line in profiler. it is the implementation of the 'increment' method in multi_index/detail/ord_index_node.hpp - the "while(x->left()!=pointer(0))x=x->left();" takes 40% of the time. – kreuzerkrieg Dec 21 '14 at 13:11
  • Will try to compose selfcontained test – kreuzerkrieg Dec 21 '14 at 13:12
  • MIC validates iterators in debug mode. Ensure you compile in release mode (with optimizations, without MIC [debugging macros](http://www.boost.org/doc/libs/1_57_0/libs/multi_index/doc/tutorial/debug.html)). – Igor R. Dec 21 '14 at 15:41
  • @sehe, sorry, took some time – kreuzerkrieg Dec 24 '14 at 13:14
  • @kreuzerkrieg Can you tell me what is taking too long? The exact code you posted results in `Checking getVersionData perf. 99999 entities retrieved. 2ms for all calulations.` with me. What is an acceptable range here? How can I use the sample to show bad performance? – sehe Dec 24 '14 at 23:01
  • Also, why are you using any_iterator if you care about performance? Why is there a flyweight that's introducing both indirection, locking and tracking, when there is much more likely gain storing `wchar_t cont*` instead of `wstring`? – sehe Dec 24 '14 at 23:27

4 Answers4

1

The fact that 80% of getVersionData´s execution time is spent in operator++ is not indicative of any performance problem per se --at most, it tells you that equal_range and std::reference_wrapper insertion are faster in comparison. Put another way, when you profile some piece of code you will typically find locations where the most time is spent, but whether this is a problem or not depends on the required overall performance.

Joaquín M López Muñoz
  • 5,243
  • 1
  • 15
  • 20
1

@kreuzerkrieg, your sample code does not exercise any kind of insertion into a vector of std::reference_wrappers! Instead, you're projecting the result of equal_range into a boost::any_range, which is expected to be fairly slow at iteration --basically, increment ops resolve to virtual calls.

So, unless I'm seriously missing something here, the sample code performance or lack thereof does not have anything to do with whatever your problem is in real code (assuming VersionView, of which you don't show the code, is not using boost::any_range).

That said, if you can afford replacing your ordered indices with equivalent hashed indices, iteration will probably be faster, but this is is an utter shot in the dark given you're not showing the real stuff.

Joaquín M López Muñoz
  • 5,243
  • 1
  • 15
  • 20
  • Now I do not insert into reference wrapper and if I do it just moves the problem from one place to another - the getData (which I excluded measurements since getting equal_range is negligible) will take all the time which will be spend in iterating over results and inserting into vector. I will post this case too. – kreuzerkrieg Dec 25 '14 at 05:33
  • here you go http://coliru.stacked-crooked.com/a/7d44672d44e49602 most time spent in std::distance and operator++ when inserting into the vector – kreuzerkrieg Dec 25 '14 at 06:27
  • BTW, I've checked hashed indexes as well, they are twice faster than ordered indices – kreuzerkrieg Dec 25 '14 at 06:30
1

I think that you're measuring the wrong things entirely. When I scale up from 3x3x11111 to 10x10x111111 (so 111x as many items in the index), it still runs in 290ms.

And populating the stuff takes orders of magnitude more time. Even deallocating the container appears to take more time.

What Doesn't Matter?

I've contributed a version with some trade offs, which mainly show that there's no sense in tweaking things: View On Coliru

  • there's a switch to avoid the any_range (it doesn't make sense using that if you care for performance)
  • there's a switch to tweak the flyweight:

    #define USE_FLYWEIGHT 0 // 0: none 1: full 2: no tracking 3: no tracking no locking
    

    again, it merely shows you could easily do without, and should consider doing so unless you need the memory optimization for the string (?). If so, consider using the OPTIMIZE_ATOMS approach:

  • the OPTIMIZE_ATOMS basically does fly weight for wstring there. Since all the strings are repeated here it will be mighty storage efficient (although the implementation is quick and dirty and should be improved). The idea is much better applied here: How to improve performance of boost interval_map lookups

Here's some rudimentary timings:

enter image description here

As you can see, basically nothing actually matters for query/iteration performance

Any Iterators: Doe They Matter?

It might be the culprit on your compiler. On my compile (gcc 4.8.2) it wasn't anything big, but see the disassembly of the accumulate loop without the any iterator:

enter image description here

As you can see from the sections I've highlighted, there doesn't seem to be much fat from the algorithm, the lambda nor from the iterator traversal. Now with the any_iterator the situation is much less clear, and if your compile optimizes less well, I can imagine it failing to inline elementary operations making iteration slow. (Just guessing a little now)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Impressive! I really appreciate your effort. Unfortunately I've measured the right thing, since the load of data (and destruction) does not matter. In real life the data is loaded from DB and stored in LRU cache which is big enough for the entity not being ejected, so it resides there happily until the entity is invalidated (DB is updated, rare stuff) or the TTL is expired (20 minutes) so the load time issue is far less acute. In any case it is nice to have better load time almost for free. – kreuzerkrieg Dec 25 '14 at 05:17
  • Just ran the any_range vs iterator_range, with the 1111100 of items as you did and I got 50 and 49 microseconds for the std::accumulate. This is quite annoying result, since according to "performance" section any_range MUST be slower than anything else... Is this my compiler to blame? Will try to find linux machine with GCC and test it there – kreuzerkrieg Dec 25 '14 at 05:39
  • As you can see in the timings table, there's not much difference with `any_range` in GCC either. That's the whole point: I just don't see a bottleneck (well, outside the realm of the expected). If MSVC doesn't blow it, then I suppose it's ok unless you can pinpoint a problem in the generated assembly. **GASP:** 50 microseconds? That's impressively fast. I got ~280 ms, not µs. Then again I did use 11,111,100 (not 1,111,100 like you mentioned in the comment). Did you notice I "fixed" the capacity of the `res` and initilializer (`0ll`) to avoid undefined behaviour on integer overflow? – sehe Dec 25 '14 at 15:21
  • Hehehe. I bet it was something (a) not shown (b) related to compilation flags :/ Now I'm mighty curious – sehe Dec 25 '14 at 15:53
  • nope, none of these, the solution is a horrible MIC abuse. and actually I dont get it, why you fail to see the problem? or the problem isnt acute enough or you just cant pinpiont the problem in your profiler because of the inlining? it just shows the time spent elsewhere? actually thats why there were two cin.get()s in my test, I was starting the profiler paused and resume profiling after the data was populated – kreuzerkrieg Dec 25 '14 at 16:05
  • How can we see the problem if you fail to describe it? What do you want (outside more sensible callgraphs from your profiler)? Do you just want better performance? Then we need to know your use case(s). Because 11111100 items all with identical m_key aren't likely it - I suppose. As posed, the sample appears to exhibit reasonable performance (reasonable being ad-hoc defined as showing no obvious unexpected bottlenecks). Of course, you can make different trade-offs, but I think it'd be rather lame if I went and suggested a flat-map instead of your MIC. I have no grounds to base that on. – sehe Dec 25 '14 at 16:15
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/67673/discussion-between-kreuzerkrieg-and-sehe). – kreuzerkrieg Dec 25 '14 at 16:19
0

Ok, so the solution I applied is as following: in addition to the odered_non_unique index (the 'byKey') I've added random_access index. When the data is loaded I rearrange the random index with m_data.get.begin(). Then when the MIC is queried for the data I just do boost::equal_range on the random index with custom predicate which emulates the same logic that was applied in ordering of 'byKey' index. That's it, it gave me fast 'size()' (O(1), as I understand) and fast traversal. Now I'm ready for your rotten tomatoes :)

EDIT 1: of course I've changed the any_range from bidirectional traversal tag to the random access one

kreuzerkrieg
  • 3,009
  • 3
  • 28
  • 59