4

I have a 2 dimensional array with fixed number of columns=4, but the number of rows is very large. I found out using vector<tuple> instead of vector<array> or was nearly 2x faster, and 5x faster than vector<vector>. Though using a tuple is quite a headache. I defined a access function to get around it:

typedef tuple<int,int,int,int> Tuple;
// access the i-th element
inline int &tget (Tuple &t, int i) {
    switch (i) {
        case 0: return std::get<0>(t); 
        case 1: return std::get<1>(t); 
        case 2: return std::get<2>(t); 
        case 3: return std::get<3>(t); 
    }   
}
vector<Tuple> array; 
// this gives the element in i-th row and j-th column
tget(array[i],j); 

I found this quite surprising and was wondering where this performance boost coming from, and if there is a neat solution with a syntax similar to vector<vector> or vector<array> using subscript operator[] with the same performance as tuple?

UPDATE: I'm using gcc4.8 with c++11 standards and I use -O3 for optimization. Because people asked about the operations The operations are quite basic. First the 2d array is filled up with up to 1M rows. First I reserve the space and then use push_back function to add elements. Then I sort the array, and finally I do a binary search on it.

This is the code block I wrote. The input will be several test cases (=T) and each test case takes an integer (N up to 20), and then reads N more integers in lens. Then because of the exponential nature of the function the sums get quite big (4^(N/2) up to roughly a million). To test between array and tuple, toggle the typedef and comment/uncomment the return line in the access function. There are four lines/blocks that have to be commented/uncommented to switch back and forth between array and tuple, and the lines are designated in the comments (// if array // if tuple comments) :

#include <iostream>
#include <algorithm>
#include <vector>
#include <tuple>
#include <array>
using namespace std;

typedef vector<int>::iterator VecIt;

// HERE you can switch between the two types 
typedef tuple<int,int,int,int> Tuple;       // define it as a tuple
//typedef array<int,4> Tuple;               // define it as array
inline int &tget (Tuple &t, int i) {
  //  return t[i];                          // if it's an array, uncomment This too

  // if it's an array comment the switch case                                                                                                                                                                                                                                                                                                                                                                                                                                               
    switch (i) {
        case 0: return std::get<0>(t);
        case 1: return std::get<1>(t);
        case 2: return std::get<2>(t);
        case 3: return std::get<3>(t);
        default:
          cerr << "tuple index out of bound" << endl;
    }
}

void comp_sums(VecIt v, VecIt vend, vector<Tuple> &sums){
    if (v==vend)
        return;
    int size = sums.size();
    for (int i=0; i<size; i++) {
        for (int j=1; j<4; j++){
            sums.push_back(sums[i]);
            tget(sums.back(),j) += *v;
        }
        tget(sums[i],0) += *v;
    }
    v++;
    comp_sums(v, vend, sums);
}

void docase( ) {
    int N;
    cin >> N;
    vector<int> lens(N);
    int lsum = 0;
    for (int i=0; i<N; i++) {
        cin >> lens[i];
        lsum += lens[i];
    }
    if (lsum % 4 != 0 ) {
        cout << 0 << endl;
        return;
    }
    lsum = lsum/4;

    vector<Tuple> sums1, sums2;
    Tuple z(0,0,0,0);                             // if it's a tuple
    //Tuple z = {0,0,0,0};                        // If it's an array 
    sums1.push_back(z); sums1.reserve(1<<N/2);
    sums2.push_back(z); sums2.reserve(1<<N/2);
    comp_sums(lens.begin()+1, lens.begin()+N/2, sums1);
    comp_sums(lens.begin()+N/2+1, lens.end(), sums2);
    sort(sums1.begin(), sums1.end());
    sort(sums2.begin(), sums2.end());
    long count = 0;
    for (int i=0; i<sums1.size(); i++) {
        for (int k=0; k<4; k++) {
            //Tuple t({lsum,lsum,lsum,lsum});     // if it's an array
            Tuple t(lsum,lsum,lsum,lsum);         // if it's a tuple
            for (int j=0; j<4; j++)
                tget(t,j) -= tget(sums1[i],(j+k)%4);
            tget(t,k) -= lens[0];
            tget(t,0) -= lens[N/2];
            bool neg = false;
            for (int j=0; j<4; j++)
                if (tget(t,j)<0){
                    neg = true;
                    break;
                }
            if (neg)
                continue;
            auto LB = lower_bound(sums2.begin(), sums2.end(), t);
            auto UB = upper_bound(LB, sums2.end(), t);
            count += (UB - LB);
        }
    }
    cout << count/6 << endl;
}


int main() {
    std::ios_base::sync_with_stdio(false);
    int T;
    cin >> T;
    while (T--) docase();
}

Here is also a test input I use. The first line says there are T=18 separate test cases. First line for each test case is N (up to 20), and then N integers follow. Here it is. The algorithm has exponential growth so the small numbers don't mean it's only a few operations. Also in measuring time I have accounted for the I/O:

18
8
132391 123854 21536 19482 133025 10945 121800 10311
8                                                                                                                                                             
12 4 12 4 4 12 12 24
8
131723 16253 23309 132227 125171 12253 16757 136227
8
8 4 4 8 4 8 12 24
8
12 12 12 8 4 8 12 28
8
115021 114654 112093 17443 20371 17810 17274 115190
8
8 8 4 4 12 4 12 20
8
12 4 4 4 4 12 12 28
8
8 12 8 12 8 4 12 28
8
4 12 8 12 8 8 4 24
20
34836 56869 24112 27796 198091 196472 50364 27364 29572 53701 52981 25779 202644 204884 53840 30056 48705 53092 43751 18419
20
207447 26867 41968 35977 36384 57042 40981 30191 47705 26907 248361 26003 53715 48237 27691 192772 60507 58194 20754 245913
20
34836 56869 24112 27796 198091 196472 50364 27364 29572 53701 52981 25779 202644 204884 53840 30056 48705 53092 43751 18419
20
207447 26867 41968 35977 36384 57042 40981 30191 47705 26907 248361 26003 53715 48237 27691 192772 60507 58194 20754 245913
20
4 2 2 4 4 2 4 2 4 2 2 4 4 3 3 4 2 3 4 1
20
2 3 2 2 3 2 3 4 3 2 4 4 4 3 2 3 4 4 3 3
20
4 4 4 4 3 3 3 2 2 4 2 4 2 2 4 2 2 2 2 1
20
4 4 2 3 4 3 4 3 2 4 4 2 2 4 3 4 3 3 2 4
Ameer Jewdaki
  • 1,758
  • 4
  • 21
  • 36
  • 12
    Can you show the benchmark that shows the performance boost including compiler and compiler flags? – nwp Jan 25 '18 at 14:32
  • 3
    I am highly surprised by your findings. Have you enabled optimizations? – Quentin Jan 25 '18 at 14:32
  • 2
    "vector instead of vector or was nearly 2x faster" I would like to see test which shows that. I bet there is something else involved. – Slava Jan 25 '18 at 14:33
  • While `vector` being slower sounds plausible, `array` and `tuple` should have the same performance... – HolyBlackCat Jan 25 '18 at 14:33
  • 1
    @HolyBlackCat actually tuple should be slower as switch involved for access. – Slava Jan 25 '18 at 14:35
  • 11
    If you are not doing a benchmark then *how* did you determine that it was twice as fast? You must have measured somehow – UnholySheep Jan 25 '18 at 14:38
  • 5
    [https://rationalwiki.org/wiki/Extraordinary_claims_require_extraordinary_evidence](https://rationalwiki.org/wiki/Extraordinary_claims_require_extraordinary_evidence), something like "brief explanation" is not enough, you definitely need to supply [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). – user7860670 Jan 25 '18 at 14:38
  • @kebs: From the OP's update " *First I reserve the space. * .." – Martin Bonner supports Monica Jan 25 '18 at 14:44
  • "...was nearly 2x faster...": what operation was faster, precisely. Do you count filling and sorting ? – kebs Jan 25 '18 at 14:44
  • 1
    @kebs: That said, a size four `array` and size four `tuple` should both involve the same amount of copying, so that shouldn't affect anything. – ShadowRanger Jan 25 '18 at 14:45
  • 6
    Basically, we don't believe you. Equal size `tuple` and `array` should be the same speed, so we think there is something else which is different. – Martin Bonner supports Monica Jan 25 '18 at 14:46
  • I'm surprised myself. I've added my entire code now. It's quite a messy algorithm so I don't know if it's helpful to see, but at least you can see the operations I'm doing on the 2d array. – Ameer Jewdaki Jan 25 '18 at 14:56
  • You should perform benchmarks without doing any input/output. They're incredibly slow operations and are unrelated to the performance of a `vector` vs `vector`. – Cornstalks Jan 25 '18 at 14:58
  • @Slava I doubt that the switch even gets compiled as is. It should be optimized to pointer arithmetic which is the equivalent of using `operator[]` on `std::array`. – patatahooligan Jan 25 '18 at 14:59
  • 1
    it's very small, up to 100 integers read in total are read. and besides I'm keeping the input/output fixed – Ameer Jewdaki Jan 25 '18 at 14:59
  • 1
    measuring the performance metrics of an algorithm/program **is benchmarking**. So please stop saying you are not doing benchmarking. You may not use specialized tools for it, but you are doing benchmarking. Doing it right or not is another matter. – bolov Jan 25 '18 at 15:00
  • The input files are fixed and I have accounted for the IO, still there is some considerable speedup by using tuples. I don't know if it's helpful to put the input files as well or not – Ameer Jewdaki Jan 25 '18 at 15:07
  • So you are claiming that `std::tuple()` is twice faster than `std::array` based on 100 integers test which involves IO? Huh. – Slava Jan 25 '18 at 15:19
  • If you follow the algorithm you realize the `sums1` and `sums2` get 4^10 big, 20 integers doesn't mean the execution is 20 steps, and as I said, I accounted for IO when I said 2x faster. I have put the input file also in the updated post. – Ameer Jewdaki Jan 25 '18 at 15:24
  • as I said the algorithm is a bit complicated to understand. But you don't have to understand it, you can just run it and check. – Ameer Jewdaki Jan 25 '18 at 15:27
  • 1
    Here is a link to a "benchmarkable" version: https://wandbox.org/permlink/1YU0rgPe1NvWOii7 On wandbox, arrays are faster than tuple, but on machine (windows, mingw), tuple are faster (but not twice as fast, about 50% faster). – Holt Jan 25 '18 at 17:24

2 Answers2

2

My guess is that the bottleneck is operator< of std::tuple vs. std::array. If you try with this user defined structure:

struct my_struct {
    int a, b, c, d;
};

inline bool operator<(my_struct const& lhs, my_struct const& rhs) {
    return std::tie(lhs.a, lhs.b, lhs.c, lhs.d) <
        std::tie(rhs.a, rhs.b, rhs.c, rhs.d);
}

This is about 5 times slower than std::tuple or std::array (tested on both my computer with MinGW and Wandbox).

Furthermore, the operator< assembly for std::tuple and std::array are different (with -O3), which may account for the performance difference between std::tuple and std::array.

fun(std::array<int, 4ul> const&, std::array<int, 4ul> const&):
  mov ecx, DWORD PTR [rdi]
  cmp DWORD PTR [rsi], ecx
  lea rdx, [rsi+16]
  lea rax, [rdi+16]
  jg .L32
  jl .L31
.L25:
  add rdi, 4
  add rsi, 4
  cmp rax, rdi
  je .L36
  mov ecx, DWORD PTR [rsi]
  cmp DWORD PTR [rdi], ecx
  jl .L32
  jle .L25
.L31:
  xor eax, eax
  ret
.L32:
  mov eax, 1
  ret
.L36:
  cmp rdx, rsi
  setne al
  ret

fun(std::tuple<int, int, int, int> const&, std::tuple<int, int, int, int> const&):
  mov edx, DWORD PTR [rsi+12]
  cmp DWORD PTR [rdi+12], edx
  mov eax, 1
  jl .L11
  mov eax, 0
  jg .L11
  mov ecx, DWORD PTR [rsi+8]
  cmp DWORD PTR [rdi+8], ecx
  mov eax, 1
  jl .L11
  mov eax, 0
  jg .L11
  mov ecx, DWORD PTR [rsi+4]
  cmp DWORD PTR [rdi+4], ecx
  mov eax, 1
  jl .L11
  mov eax, 0
  jg .L11
  mov eax, DWORD PTR [rsi]
  cmp DWORD PTR [rdi], eax
  setl al
.L11:
  rep ret
Holt
  • 36,600
  • 7
  • 92
  • 139
  • In the case of the tuple, MORE operations/comparisons are happening at the same time. Like I said in my answer:the operations on a struct are better optimized, than on an array. – Robert Andrzejuk Jan 25 '18 at 19:00
  • @RobertAndrzejuk You're right, all `operator<` are inlined, I don't know why it was not the last time I checked. *"operations on a struct are better optimized"* - This does not seem true because 1) on Wandbox, arrays perform better than tuple and 2), my custom structure should perform like tuple (the generated assembly for `operator<` are identical), or at least better than arrays, if this was the case. – Holt Jan 25 '18 at 19:13
1

Another question is similar to your question : Is it fine to replace an array with hierarchal structures and classes?

It has code and benchmark.

What is basically happening is that the tuple creates a structure which the compiler is able to better optimize operations on, than on an array of elements.

Robert Andrzejuk
  • 5,076
  • 2
  • 22
  • 31