How to improve performance of boost interval_map lookups

Question

I'm using a boost::icl::interval_map to map byte ranges to a set of strings. The map is loaded from a (sorted) disk file, and then I do lookups using the code below.

The problem is the lookups are really slow.

In my test, I inserted 66425 ranges into the map. I profiled the code and basically > 99.9% of the time is spent in various Boost functions (there's not a particular function that is slow, there's a lot of time spread over many functions).

What can be done to make this faster?

Is my tree unbalanced (how do I find out? how can I rebalance it?)

Is using set<string> a problem?

Is calculating the intersection of the map and the window a problem? (Although it's what I need, so I can't see how else to do it).

using namespace std;
typedef set<string> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;
void
find_range (const ranges *map, uint64_t start, uint64_t end,
            void (*f) (uint64_t start, uint64_t end, const char *object,
                       void *opaque),
            void *opaque)
{
  boost::icl::interval<uint64_t>::type window;
  window = boost::icl::interval<uint64_t>::right_open (start, end);

  ranges r = *map & window;

  ranges::iterator iter = r.begin ();
  while (iter != r.end ()) {
    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, iter2->c_str (), opaque);
      iter2++;
    }
    iter++;
  }
}

The first few profile entries:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
  9.77      0.13     0.13 21866814     0.01           boost::icl::interval_bounds::interval_bounds(unsigned char)
  6.02      0.21     0.08  9132134     0.01           boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::lower(boost::icl::discrete_interval<unsigned long, std::less> const&)
  6.02      0.29     0.08  6004967     0.01           boost::enable_if<boost::icl::is_discrete_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::is_empty<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
  5.26      0.36     0.07 21210093     0.00           boost::icl::discrete_interval<unsigned long, std::less>::bounds() const
  5.26      0.43     0.07 11964109     0.01           std::less<unsigned long>::operator()(unsigned long const&, unsigned long const&) const
  4.51      0.49     0.06 35761849     0.00           std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_S_left(std::_Rb_tree_node_base const*)
  4.51      0.55     0.06 12009934     0.00           boost::icl::operator==(boost::icl::interval_bounds, boost::icl::interval_bounds)
  3.76      0.60     0.05 12078493     0.00           boost::icl::discrete_interval<unsigned long, std::less>::upper() const
  3.76      0.65     0.05 12077959     0.00           boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type>::type boost::icl::upper<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
  3.76      0.70     0.05  8837475     0.01           boost::icl::interval_bounds::bits() const
  3.76      0.75     0.05  6004967     0.01           boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::domain_less_equal<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&)
  3.01      0.79     0.04  5891650     0.01           boost::icl::is_right_closed(boost::icl::interval_bounds)

Data set: http://oirase.annexia.org/tmp/bmap.txt
Full code: http://git.annexia.org/?p=virt-bmap.git;a=tree

If you want serious help here, you'll need to post a selfcontained sample with a dataset that exposes the performance problem — sehe, Nov 26 '14 at 15:46
What does this code do for you? `boost::icl::interval::type range = iter->first;` make a range of out of a set? — Surt, Nov 26 '14 at 16:31
The intersection (`map & window`) creates another interval_map (`r`). The iteration iterates over this map. `iter->first` returns the first part (== key == `uint64_t` interval) of the iterator. — Rich, Nov 26 '14 at 16:40

sehe · Accepted Answer · 2014-11-26T23:48:27.527

In this answer I present three optimizations:

replacing the objects std::set by boost::container::flat_set for improved locality (and likely reallocation costs, since most object sets are <4 elements)

UPDATE In my final version below, simply replacing boost::container::flat_map back with std::set degraded performance of just find_range by a factor ~2x and find_range_ex by a factor of ~4x on my test system
replacing the object id std::string by string_atom (which is technically a char const* but logically unique). This technique is similar to interned strings in various VM implementations (like Java/.NET).

NOTE: The current implementation of make_atom is exceedingly simplistic and never frees atoms. You would potentially want to back the strings in a deque, use Boost Flyweights, a pool allocator or some combination of those to make it smarter. However, whether this is required depends on your use cases
replacing the map intersection with a equal_range call, which saves the bulk of time by avoiding copying (large amounts of) data

_UPDATE When applying just this optimization in isolation the speed up is already 26~30x. Note that the memory usage is roughly double at ~20MiB compared to when including all three optimizations_

Volume and data efficiency

Before I start, I like to know what the data looks like. So, writing some code to parse that bmap.txt input, we get:

Source On Coliru

Parsed ok
Histogram of 66425 input lines
d: 3367
f: 20613
p: 21222
v: 21223
ranges size:            6442450944
ranges iterative size:  21223
Min object set:         1.000000
Max object set:         234.000000
Average object set:     3.129859
Min interval width:     1024.000000
Max interval width:     2526265344.000000
Average interval width: 296.445177k
First:                  [0,1048576)
Last:                   [3916185600,6442450944)
String atoms:           23904 unique in 66425 total
Atom efficiency:        35.986451%

As you can see the sets are usually ~3 items, and many are duplicated.

Using the make_atom object naming method with boost::flat_set reduced memory allocation from ~15GiB to ~10Gib.

This optimization also reduces string comparison to pointer comparison for set insertion and the Combiner strategy of the interval_map, so for larger data sets this has the potential to have a lot of speedup.

Query efficiency

Query performance is indeed severely crippled by the partial copy of the input.

Don't copy, instead view the overlapping range, simply by replacing:

  const ranges r = *map & window;
  ranges::const_iterator iter = r.begin ();
  while (iter != r.end ()) {

with

  auto r = map->equal_range(window);
  ranges::const_iterator iter = r.first;
  while (iter != r.second) {

On my system running 10000 identical randomized queries with both versions results in a speedup of 32x:

10000 'random' OLD lookups resulted in 156729884 callbacks in 29148ms
10000 'random' NEW lookups resulted in 156729884 callbacks in 897ms

real    0m31.715s
user    0m31.664s
sys 0m0.012s

The runtime is now dominated by the parsing/statistics. Full benchmark code is here: On Coliru

#define BOOST_RESULT_OF_USE_DECTYPE
#define BOOST_SPIRIT_USE_PHOENIX_V3

/* virt-bmap examiner plugin
 * Copyright (C) 2014 Red Hat Inc.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
#include <assert.h>

#include <boost/icl/interval.hpp>
#include <boost/icl/interval_set.hpp>
#include <boost/icl/interval_map.hpp>
#include <boost/container/flat_set.hpp>

using namespace std;

/* Maps intervals (uint64_t, uint64_t) to a set of strings, where each
 * string represents an object that covers that range.
 */

static size_t atoms_requested = 0;
static size_t atoms_unique_created = 0;

using string_atom = const char*;
string_atom make_atom(std::string&& s)
{
    static std::set<std::string> s_atoms;
    atoms_requested += 1;

    auto it = s_atoms.find(s);
    if (it != s_atoms.end())
        return it->c_str();

    atoms_unique_created += 1;
    return s_atoms.insert(std::move(s)).first->c_str();
}

typedef boost::container::flat_set<string_atom> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;

ranges*
new_ranges (void)
{
  return new ranges ();
}

void
free_ranges (ranges *mapv)
{
  ranges *map = (ranges *) mapv;
  delete map;
}

extern "C" void
insert_range (void *mapv, uint64_t start, uint64_t end, const char *object)
{
  ranges *map = (ranges *) mapv;
  objects obj_set;
  obj_set.insert (obj_set.end(), object);
  map->add (std::make_pair (boost::icl::interval<uint64_t>::right_open (start, end), // SEHE added std::
                       obj_set));
}

extern "C" void
iter_range (void *mapv, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
  ranges *map = (ranges *) mapv;
  ranges::iterator iter = map->begin ();
  while (iter != map->end ()) {
    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
      iter2++;
    }
    iter++;
  }
}

extern "C" void
find_range (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
  const ranges *map = (const ranges *) mapv;

  boost::icl::interval<uint64_t>::type window;
  window = boost::icl::interval<uint64_t>::right_open (start, end);

  const ranges r = *map & window;

  ranges::const_iterator iter = r.begin ();
  while (iter != r.end ()) {
    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
      iter2++;
    }
    iter++;
  }
}

extern "C" void
find_range_ex (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
  const ranges *map = (const ranges *) mapv;

  boost::icl::interval<uint64_t>::type window;
  window = boost::icl::interval<uint64_t>::right_open (start, end);

#if 0
  const ranges r = *map & window;
  ranges::const_iterator iter = r.begin ();
  while (iter != r.end ()) {
#else
  auto r = map->equal_range(window);
  ranges::const_iterator iter = r.first;
  while (iter != r.second) {
#endif

    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
      iter2++;
    }
    iter++;
  }
}

#include <memory>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>
#include <fstream>
#include <chrono>

std::map<char, size_t> histo;

bool insert_line_of_input(ranges& bmap_data, uint64_t b, uint64_t e, char type, std::string& object) {
    if (!object.empty())
        histo[type]++;
    //std::cout << std::hex << b << " " << e << " " << type << " " << object << "\n";

#if 0
    object.insert(object.begin(), ':');
    object.insert(object.begin(), type);
#endif
    insert_range(&bmap_data, b, e, make_atom(std::move(object)));
    return true;
}

std::vector<std::pair<uint64_t, uint64_t> > generate_test_queries(ranges const& bmap_data, size_t n) {
    std::vector<std::pair<uint64_t, uint64_t> > queries;
    queries.reserve(n);

    for (size_t i = 0; i < n; ++i)
    {
        auto start = (static_cast<uint64_t>(rand()) * rand()) % bmap_data.size();
        auto end   = start + rand();

        queries.emplace_back(start,end);
    }

    return queries;
}

ranges read_mapfile(const char* fname) {
    std::ifstream ifs(fname);
    boost::spirit::istream_iterator f(ifs >> std::noskipws), l;

    ranges bmap_data;

    namespace phx = boost::phoenix;
    using namespace boost::spirit::qi;
    uint_parser<uint64_t, 16> offset;
    if (!phrase_parse(f,l,
                ("1 " >> offset >> offset >> char_("pvdf") >> as_string[lexeme[+graph] >> attr('/') >> lexeme[*~char_("\r\n")]]) 
                [ _pass = phx::bind(insert_line_of_input, phx::ref(bmap_data), _1, _2, _3, _4) ] % eol >> *eol, 
                blank))
    {
        exit(255);
    } else
    {
        std::cout << "Parsed ok\n";
    }

    if (f!=l)
        std::cout << "Warning: remaining input '" << std::string(f,l) << "\n";

    return bmap_data;
}

void report_statistics(ranges const& bmap_data) {
    size_t total = 0;
    for (auto e : histo) total += e.second;

    std::cout << "Histogram of " << total << " input lines\n";

    for (auto e : histo)
        std::cout << e.first << ": " << e.second << "\n";

    namespace ba = boost::accumulators;
    ba::accumulator_set<double, ba::stats<ba::tag::mean, ba::tag::max, ba::tag::min> > 
        object_sets, interval_widths;

    for (auto const& r : bmap_data)
    {
        auto width = r.first.upper() - r.first.lower();
        assert(width % 1024 == 0);

        interval_widths(width);
        object_sets(r.second.size());
    }

    std::cout << std::fixed;
    std::cout << "ranges size:            " << bmap_data.size()                 << "\n";
    std::cout << "ranges iterative size:  " << bmap_data.iterative_size()       << "\n";

    std::cout << "Min object set:         " << ba::min(object_sets)             << "\n" ;
    std::cout << "Max object set:         " << ba::max(object_sets)             << "\n" ;
    std::cout << "Average object set:     " << ba::mean(object_sets)            << "\n" ;
    std::cout << "Min interval width:     " << ba::min(interval_widths)         << "\n" ;
    std::cout << "Max interval width:     " << ba::max(interval_widths)         << "\n" ;
    std::cout << "Average interval width: " << ba::mean(interval_widths)/1024.0 << "k\n" ;
    std::cout << "First:                  " << bmap_data.begin()->first         << "\n" ;
    std::cout << "Last:                   " << bmap_data.rbegin()->first        << "\n" ;

    std::cout << "String atoms:           " << atoms_unique_created << " unique in " << atoms_requested << " total\n";
    std::cout << "Atom efficiency:        " << (atoms_unique_created*100.0/atoms_requested) << "%\n";
}

void perform_comparative_benchmarks(ranges const& bmap_data, size_t number_of_queries) {
    srand(42);
    auto const queries = generate_test_queries(bmap_data, number_of_queries);

    using hrc = std::chrono::high_resolution_clock;
    {
        auto start = hrc::now();
        size_t callbacks = 0;

        for (auto const& q: queries)
        {
            find_range(&bmap_data, q.first, q.second, 
                    [](uint64_t start, uint64_t end, const char *object, void *opaque) {
                    ++(*static_cast<size_t*>(opaque));
                    }, &callbacks);
        }
        std::cout << number_of_queries << " 'random' OLD lookups resulted in " << callbacks 
                  << " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
    }

    {
        auto start = hrc::now();
        size_t callbacks = 0;

        for (auto const& q: queries)
        {
            find_range_ex(&bmap_data, q.first, q.second, 
                    [](uint64_t start, uint64_t end, const char *object, void *opaque) {
                    ++(*static_cast<size_t*>(opaque));
                    }, &callbacks);
        }
        std::cout << number_of_queries << " 'random' NEW lookups resulted in " << callbacks 
                  << " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
    }
}

int main() {
    auto bmap = read_mapfile("bmap.txt");

    report_statistics(bmap);

    perform_comparative_benchmarks(bmap, 1000);

#if 0 // to dump ranges to console
    for (auto const& r : bmap)
    {
        std::cout << r.first << "\t" << r.second.size() << "\t";
        std::copy(r.second.begin(), r.second.end(), std::ostream_iterator<std::string>(std::cout, "\t"));
        std::cout << "\n";
    }
#endif
}

@Chris I've committed the changes related to the various optimizations, and the benchmark driver code to a **[public github branch](https://github.com/sehe/virt-bmap/commits/q27152834)** so it's easy to review them. The makefile is included as `q27152834.mak` — sehe, Nov 27 '14 at 00:29
Use `std::unordered_set` in `make_atom` for another order of magnitude speed up for that function according to [this](http://tinodidriksen.com/2010/04/02/cpp-set-performance/). — Surt, Nov 27 '14 at 16:14
@Surt all depends. Anyways, it's hardly about optimizing `make_atom` anyways, it's irrelevant to program runtime, and certainly not relevant to `find_rages_ times :) (you'd better think about whether you need to fix it leaking resources). Always follow the profiler! — sehe, Nov 27 '14 at 16:15
Its just because I *really* like the idea of only having only one copy of all const strings that I immediately jumped on the `set` :) To stop leaking you could make something like this `using strset = std::unordered_set; strset& GetStrSet() { static strset theOne; return theOne; }` and change `make_atom` `static strset& s_atoms = GetStrSet();` and when you want to stop using it call `GetStrSet().clear();` — Surt, Nov 27 '14 at 17:21
Understood :) anyways I named a few approaches in similar vein. It's up to the OP if you ask me — sehe, Nov 27 '14 at 17:24

Surt · Answer 2 · 2014-11-26T16:44:50.447

You have all the odds against you with this code.

Memory usage.
With 66425 ranges you are way over L1 cache and with the included set of strings you blow L2D also, and might even exceed L3. This means you will often have latency of 50-200 cpu cycles for each data access and your out-of-order execution will only cover ~32 cycles, meaning that the CPU will essentially stall all the time. This is mitigated a lot if all memory is accessed sequentially through hardware prefetchers.
Pointer chasing through map.
Maps and set are typically implemented as a rb_tree with pointers. (interval_map might be different?). To further increase the problem, the access of the pointers will insure that the data is not sequentially accessed, meaning you will get hit by the high latency.
Call of function pointer/virtual function. Surprisingly this doesn't show up in the top 12 unless you use more interval functions inside f. Later when you have solved the other problems you might see that every call to this function will introduce delay of X cycles for every call, where X is the length of the CPU pipeline.

If your using perf to get the performance data, please add the result of a run with perf stat -d. This should show the problems mentioned above with lots of cache misses and idle CPU.

The usage of set<string> is bad because its pointer chasing, you should use a vector<string> instead, you will need to keep it sorted yourself. This should speed up the access in f, but doesn't mitigate the other problems.

Adding an allocator, implementing an arena, to the interval_map might speed the access up as the data should have better localization.

OK, although 66K data ranges doesn't seem to be an excessive amount of data. Maybe there is some other structure that would handle this problem better? — Rich, Nov 26 '14 at 16:31
66K * sizeof(node) * averageSize(set) * averageLength(string) = more than a little, but not a fatal problem before it exceed your L3. 64K*32*32*16=2^16*2^5*2^5*2^4=2^30 or one GB, is that a fair estimate? To solve the problem you would need to reconsider the entire problem, maybe use vector for all structures. — Surt, Nov 26 '14 at 16:39
The size of the input file is 3285763 bytes. (Of course the input file is not the in-memory data structure, but I'd hope that the in-memory data isn't an order of magnitude larger). My machine has 4 MB of L3 cache. However accesses have a *lot* of locality. I managed to make the code far faster just by hand-caching the case where the same lookup is repeated twice. That doesn't sound like a cache problem to me. — Rich, Nov 26 '14 at 16:45
@Rich Good news, my answer has three simple tweaks that combined create a [performance improvement of ~32x](http://stackoverflow.com/a/27160089/85371). — sehe, Nov 26 '14 at 22:39

How to improve performance of boost interval_map lookups

2 Answers2

Volume and data efficiency

Query efficiency

Linked