How to store binary data when you only care about speed?

Question

I have N points in D dimensions, where let's say N is 1 million and D 1 hundred. All my points have binary coordinates, i.e. {0, 1}^D, and I am only interested in speed.

Currently my implementation uses std::vector<int>. I am wondering if I could benefit in terms of faster execution by changing my data-structure. I am only doing insertions and searches (I don't change the bits).

All related questions I found mention std::vector<char>, std::vector<bool> and std::bitset, but all mention the space benefits one should get by using such structures.

What's the appropriate data structure, when speed is of main concern, for binary data in C++?

I intend to populate my data structure with the binary data and then do a lot of contiguous searches (I mean that I don't really care for the i-th coordinate of a point, if I am accessing a point I will access all of its coordinates continuously). I will compute the Hamming distance between each other.

@ks1322, it's so hard for me to see what you edited there, but I can see that you didn't upvote - does this mean that the question suck and should be deleted? — gsamaras, Nov 23 '16 at 20:29
No question sucks. Some of need a little coaching, but every question has potential. — nicomp, Nov 23 '16 at 20:35
@nicomp that's a bit optimistic, but I see you point. However, you, as well as ks1322 saw the question and did not upvote. I am starting to worry, do you have any suggestion for improvement? — gsamaras, Nov 23 '16 at 20:36
I've just fixed broken formatting, you can easily see it by clicking on `side-by-side markdown`. — ks1322, Nov 23 '16 at 20:38
@gsamaras I'd like to see some code so I know the OP put in some effort. I don't downvote questions as a general rule because too many of my questions have been undeservedly downvoted, IMHO. — nicomp, Nov 23 '16 at 20:48
@nicomp I can't post my entire project, it's big and I am not sure if creating a minimal example would help..I mean it's a general question. :) ks1322 thanks for the tip! — gsamaras, Nov 23 '16 at 20:50
@gsamaras One important aspect of debugging a problem is being able to create a minimal example that reproduces the problem. Often doing that will reveal the issue before it even gets to SO. — nicomp, Nov 23 '16 at 20:54
Yes I agree @nicomp, but I don't have a debugging problem to solve here. :) — gsamaras, Nov 23 '16 at 20:55
@That's the reason people want to zap you. I'm trying to help you not get downvoted. — nicomp, Nov 23 '16 at 20:55
@nicomp thanks for the comment and trying to help, I am just explaining you why I don't provide a min. e.g. here. In general, a minimal example is a must. ;) — gsamaras, Nov 23 '16 at 20:57
@gsamaras Well, you're doing very well so far 'cause you have no downvotes and you haven't been deleted. On this site that's a victory these days. — nicomp, Nov 23 '16 at 21:02
@nicomp it's too soft. A huge amount of questions that should be closed survive, the rate of nice questions is dropping fast, but that's a broader issue, let's focus on the bits! :D — gsamaras, Nov 23 '16 at 21:04
@gsamaras that really depends on what you are trying to do and what you are trying to optimize, especially with such big numbers. There is no one good answer. Are you inserting once and then search a lot? or are you searching and inserting all the time? what kind of searches are you running? Is most of the data initialized or does it remain empty? Do you care about average or worst case performance? — user1708860, Nov 23 '16 at 21:12
Cool questions, I will update @user1708860, but what do you mean empty? To have all bits 0? I care about the time that my program will run. — gsamaras, Nov 23 '16 at 21:14
@gsamaras it was just an example. I meant that you have to give us more data about your use case. There are way too many ways in which "inserting and searching" could be done. — user1708860, Nov 23 '16 at 21:20
@gsamaras generally speaking - you want to write it once, unit tests included and then measure it. If after measuring - you think it's slow - then you optimize. In most cases you would not have to optimize. — user1708860, Nov 23 '16 at 21:21
I would recommend to test your implementation. Play around with different types for your binary value. Is your N fixed or has an upper bound? If so, try using an array. Perhaps your performance will mostly be a cache issue, i.e. you might end up optimizing your access pattern if you have any control over that. — Stan, Nov 23 '16 at 22:48
@ks1322 thanks. I posted an answer guys, hope you like it, thanks for the inputs! :) — gsamaras, Nov 24 '16 at 19:02

Veedrac · Answer 1 · 2016-11-23T23:24:49.067

If the values are independently, uniformly distributed, and you want to find the Hamming distance between two independently, randomly chosen points, the most efficient layout is a packed array of bits.

This packed array would ideally be chunked into the largest block size over which your popcnt instruction works: 64 bits. The hamming distance is the sum of popcnt(x_blocks[i] ^ y_blocks[i]). On processors with efficient unaligned accesses, byte alignment with unaligned reads is likely to be most efficient. On processors where unaligned reads incur a penalty, one should consider whether the memory overhead of aligned rows is worth faster logic.

Hmm, difficult choice. In the answer I posted now I didn't take note of these. — gsamaras, Nov 24 '16 at 19:14

score 2 · Accepted Answer · answered Nov 24 '16 at 00:25

Locality of reference will likely be the driving force. So it's fairly obvious that you represent the D coordinates of a single point as a contiguous bitvector. std::bitset<D> would be a logical choice.

However, the next important thing to realize is that you see locality benefits easily up to 4KB. This means that you should not pick a single point and compare it against all other N-1 points. Instead, group points in sets of 4KB each, and compare those groups. Both ways are O(N*N), but the second will be much faster.

You may be able to beat O(N*N) by use of the triangle inequality - Hamming(a,b)+Hamming(b,c) >= Hamming (a,c). I'm just wondering how. It probably depends on how you want your output. The naive output would be a N*N set of distances, and that's unavoidably O(N*N).

Thanks for the input MSalters! In the answer I posted now I couldn't fairly compare `std::bitset`. — gsamaras, Nov 24 '16 at 19:13

score 0 · Answer 3 · edited May 23 '17 at 11:59

0

I wrote a simple program to populate and contiguously access a data structure with binary data:

std::vector<int>
std::vector<char>
std::vector<bool>
std::bitset

I used my Time measurements. I used -O3 optimization flag, N = 1 mil and D = 100.

This is the code for vectors:

#include <vector>
#include <iostream>
#include <random>
#include <cmath>
#include <numeric>
#include <functional> //plus, equal_to, not2

#include <ctime>
#include <ratio>
#include <chrono>

#define T int

unsigned int hd(const std::vector<T>& s1, const std::vector<T>::iterator s2)
{
    return std::inner_product(
        s1.begin(), s1.end(), s2, 
        0, std::plus<unsigned int>(),
        std::not2(std::equal_to<std::vector<T>::value_type>())
    );
}


std::uniform_int_distribution<int> uni_bit_distribution(0, 1);
std::default_random_engine generator(std::chrono::system_clock::now().time_since_epoch().count());

// g++ -Wall -O3 bitint.cpp -o bitint
int main()
{
    const int N = 1000000;
    const int D = 100;
    unsigned int hamming_dist[N] = {0};
    unsigned int ham_d[N] = {0};

    std::vector<T> q;
    for(int i = 0; i < D; ++i)
        q.push_back(uni_bit_distribution(generator));

    using namespace std::chrono;
    high_resolution_clock::time_point t1 = high_resolution_clock::now();


    std::vector<T> v;
    v.resize(N * D);
    for(int i = 0; i < N; ++i)
        for(int j = 0; j < D; ++j)
            v[j + i * D] = uni_bit_distribution(generator);


    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double> >(t2 - t1);

    std::cout << "Build " << time_span.count() << " seconds.\n";

    t1 = high_resolution_clock::now();

    for(int i = 0; i < N; ++i)
        for(int j = 0; j < D; ++j)
        hamming_dist[i] += (v[j + i * D] != q[j]);

    t2 = high_resolution_clock::now();
    time_span = duration_cast<duration<double> >(t2 - t1);
    std::cout << "No function hamming distance " << time_span.count() << " seconds.\n";

    t1 = high_resolution_clock::now();

    for(int i = 0; i < N; ++i)
        ham_d[i] = hd(q, v.begin() + (i * D));

    t2 = high_resolution_clock::now();
    time_span = duration_cast<duration<double> >(t2 - t1);
    std::cout << "Yes function hamming distance " << time_span.count() << " seconds.\n";

    return 0;
}

The code for std::bitset can be found in: XOR bitset when 2D bitset is stored as 1D

For std::vector<int> I got:

Build 3.80404 seconds.
No function hamming distance 0.0322335 seconds.
Yes function hamming distance 0.0352869 seconds.

For std::vector<char> I got:

Build 8.2e-07 seconds.
No function hamming distance 8.4e-08 seconds.
Yes function hamming distance 2.01e-07 seconds.

For std::vector<bool> I got:

Build 4.34496 seconds.
No function hamming distance 0.162005 seconds.
Yes function hamming distance 0.258315 seconds.

For std:bitset I got:

Build 4.28947 seconds.
Hamming distance 0.00385685 seconds.

std::vector<char> seems to be the winner.

edited May 23 '17 at 11:59

Community

1
1

answered Nov 24 '16 at 18:59

gsamaras

71,951
46
188
305

Those timings are meaningless; a decent optimizer can (and does) avoid running the loop altogether. – Veedrac Nov 24 '16 at 20:41
Yes, I know. The problem is with the code, not the way you called `g++`. – Veedrac Nov 24 '16 at 20:57
@Veedrac I am not sure what you want me to change! :/ – gsamaras Nov 24 '16 at 21:11
You should prevent the optimizer from removing the code. Normally a quick way to do this is by making sure inputs that are run-time variables in the real program aren't constant foldable in the code you write (so the compiler can't do the work at compile-time), and that all outputs that will get used in the real program *can* get used in the output program (so the compiler can't just skip the work). – Veedrac Nov 24 '16 at 22:13
@Veedrac I am really sorry but I don't see a trivial way of doing this (maybe I am too tired, since I am racing for a deadline that is approaching fast). – gsamaras Nov 24 '16 at 22:41
A super simple way if you're rushing is to take arguments from `argv`, including an element of the result to print. – Veedrac Nov 25 '16 at 00:32
@Veedrac oh so you were just talking about `N` and `D`? – gsamaras Nov 25 '16 at 20:25
also `v`, `hamming_dist` and `ham_d`. – Veedrac Nov 25 '16 at 20:29
I don't get it @Veedrac. `v` is constructed randomly, what should I pass in the *cmd*? Also the arrays to store the distances, I don't get it either! – gsamaras Nov 25 '16 at 20:37
`v` is fine since the random data is seeded from the time. But the compiler can just "forget" to write results to `hamming_dist` and `ham_d`, since those are never used. This is evidently what happened with your `std::vector` results, since that's faster than physically feasible. If you write out an arbitrary element, the compiler at least has to do the work to calculate it! – Veedrac Nov 25 '16 at 21:02
I see @Veedrac, but how to pass the arrays for cmd? :/ – gsamaras Nov 25 '16 at 21:06
The difference is way too big for the result to be valid. – harold Nov 26 '16 at 13:37
@harold that's the timings I got... :) – gsamaras Nov 26 '16 at 15:09
@gsamaras yes and they're obviously wrong. The only way there's going to be a 7 orders of magnitude difference is by the compiler being tricky and deleting the benchmarked code. Consider that 8E-7 second is about 3000 clock cycles. You can't do 1000000 of *anything* in that time. – harold Nov 26 '16 at 17:19
@gsamaras you should try printing the result and see if you can reproduce the measurements – user1708860 Nov 26 '16 at 20:57
That makes sense.. :/ – gsamaras Nov 26 '16 at 23:57

How to store binary data when you only care about speed?

3 Answers3

Linked