How to improve the speed of merkle root calculation in C++?

Question

I am trying to optimise the merkle root calculation as much as possible. So far, I implemented it in Python which resulted in this question and the suggestion to rewrite it in C++.

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>

#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>



std::vector<unsigned char> double_sha256(std::vector<unsigned char> a, std::vector<unsigned char> b)
{
    unsigned char inp[64];
    int j=0;
    for (int i=0; i<32; i++)
    {
        inp[j] = a[i];
        j++;
    }
    for (int i=0; i<32; i++)
    {
        inp[j] = b[i];
        j++;
    }

    const EVP_MD *md_algo = EVP_sha256();
    unsigned int md_len = EVP_MD_size(md_algo);
    std::vector<unsigned char> out( md_len );
    EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
    EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
    return out;
}

std::vector<std::vector<unsigned char> > calculate_merkle_root(std::vector<std::vector<unsigned char> > inp_list)
{
   std::vector<std::vector<unsigned char> > out;
   int len = inp_list.size();
   if (len == 1)
   {
        out.push_back(inp_list[0]);
        return out;
   }
   for (int i=0; i<len-1; i+=2)
   {
        out.push_back(
            double_sha256(inp_list[i], inp_list[i+1])
        );
   }
   if (len % 2 == 1)
   {
        out.push_back(
            double_sha256(inp_list[len-1], inp_list[len-1])
        );
   }
   return calculate_merkle_root(out);
}



int main()
{
    std::ifstream infile("txids.txt");

    std::vector<std::vector<unsigned char> > txids;
    std::string line;
    int count = 0;
    while (std::getline(infile, line))
    {
        unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
        std::vector<unsigned char> buf2;
        for (int i=31; i>=0; i--)
        {
            buf2.push_back(
                buf[i]
            );
        }
        txids.push_back(
            buf2
        );
        count++;
    }
    infile.close();
    std::cout << count << std::endl;

    std::vector<std::vector<unsigned char> > merkle_root_hash;
    for (int k=0; k<1000; k++)
    {
        merkle_root_hash = calculate_merkle_root(txids);
    }
    std::vector<unsigned char> out0 = merkle_root_hash[0];
    std::vector<unsigned char> out;
    for (int i=31; i>=0; i--)
    {
        out.push_back(
            out0[i]
        );
    }

    static const char alpha[] = "0123456789abcdef";
    for (int i=0; i<32; i++)
    {
        unsigned char c = out[i];
        std::cout << alpha[ (c >> 4) & 0xF];
        std::cout << alpha[ c & 0xF];
    }
    std::cout.put('\n');

    return 0;
}

However, the performance is worse compared to the Python implementation (~4s):

$ g++ test.cpp -L/usr/local/opt/openssl/lib -I/usr/local/opt/openssl/include -lcrypto
$ time ./a.out 
1452
289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e

real      0m9.245s
user      0m9.235s
sys       0m0.008s

The complete implementation and the input file are available here: test.cpp and txids.txt.

How can I improve the performance? Are the compiler optimizations enabled by default? Are there faster sha256 libraries than openssl available?

You should pass your vectors to functions by reference. You should reserve the size of your vectors where you know the size in advance. 1D vectors will likely improve performance over 2D vectors — Alan Birtles, May 03 '21 at 06:19
I just created C++ and Python library that is `3.5x` times faster than OpenSSL version in solving your task and `7.3x` times faster than `hashlib` version. Library based on [SIMD](https://en.wikipedia.org/wiki/SIMD) instructions. See my posts below - [Python version](https://stackoverflow.com/a/67450238/941531) and [C++ version](https://stackoverflow.com/a/67450494/941531). Please put a look, I devoted quite a lot of time ! UpVoted your both questions by the way, they are very good and interesting! — Arty, May 08 '21 at 18:34

score 2 · Accepted Answer · answered May 03 '21 at 08:47

There are plenty of things you can do to optimize the code.

Here is the list of the important points:

compiler optimizations need to be enabled (using -O3 in GCC);
std::array can be used instead of the slower dynamically-sized std::vector (since the size of a hash is 32), one can even define a new Hash type for clarity;
parameters should be passed by reference (C++ pass parameter by copy by default)
the C++ vectors can be reserved to pre-allocate the memory space and avoid unneeded copies;
OPENSSL_free must be called to release the allocated memory of OPENSSL_hexstr2buf;
push_back should be avoided when the size is a constant known at compile time;
using std::copy is often faster (and cleaner) than a manual copy;
std::reverse is often faster (and cleaner) than a manual loop;
the size of a hash is supposed to be 32, but one can check that using assertions to be sure it is fine;
count is not needed as it is the size of the txids vector;

Here is the resulting code:

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <cstring>
#include <array>
#include <algorithm>
#include <cassert>

#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>

using Hash = std::array<unsigned char, 32>;

Hash double_sha256(const Hash& a, const Hash& b)
{
    assert(a.size() == 32 && b.size() == 32);

    unsigned char inp[64];
    std::copy(a.begin(), a.end(), inp);
    std::copy(b.begin(), b.end(), inp+32);

    const EVP_MD *md_algo = EVP_sha256();
    assert(EVP_MD_size(md_algo) == 32);

    unsigned int md_len = 32;
    Hash out;
    EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
    EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
    return out;
}

std::vector<Hash> calculate_merkle_root(const std::vector<Hash>& inp_list)
{
   std::vector<Hash> out;
   int len = inp_list.size();
   out.reserve(len/2+2);
   if (len == 1)
   {
        out.push_back(inp_list[0]);
        return out;
   }
   for (int i=0; i<len-1; i+=2)
   {
        out.push_back(double_sha256(inp_list[i], inp_list[i+1]));
   }
   if (len % 2 == 1)
   {
        out.push_back(double_sha256(inp_list[len-1], inp_list[len-1]));
   }
   return calculate_merkle_root(out);
}

int main()
{
    std::ifstream infile("txids.txt");

    std::vector<Hash> txids;
    std::string line;
    while (std::getline(infile, line))
    {
        unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
        Hash buf2;
        std::copy(buf, buf+32, buf2.begin());
        std::reverse(buf2.begin(), buf2.end());
        txids.push_back(buf2);
        OPENSSL_free(buf);
    }
    infile.close();
    std::cout << txids.size() << std::endl;

    std::vector<Hash> merkle_root_hash;
    for (int k=0; k<1000; k++)
    {
        merkle_root_hash = calculate_merkle_root(txids);
    }
    Hash out0 = merkle_root_hash[0];
    Hash out = out0;
    std::reverse(out.begin(), out.end());

    static const char alpha[] = "0123456789abcdef";
    for (int i=0; i<32; i++)
    {
        unsigned char c = out[i];
        std::cout << alpha[ (c >> 4) & 0xF];
        std::cout << alpha[ c & 0xF];
    }
    std::cout.put('\n');

    return 0;
}

On my machine, this code is 3 times faster than the initial version and 2 times faster than the Python implementation.

This implementation spends >98% of its time in EVP_Digest. As a result, if you want a faster code, you could try to find a faster hashing library although OpenSSL should be already pretty fast. The current code already succeed to compute 1.7 millions hashes per second in sequential on a mainstream CPU. This is quite good. Alternatively, you can also parallelize the program using OpenMP (this is roughly 5 times faster on my 6 core machine).

"parameters should be passed by reference". This is a bit _too_ strong. Since C++11 we have move semantics, and vectors are moveable. In this example though, there's the oddity that the same `txids` is hashed 1000 times (for timing purposes I presume). If the optimizer would spot that, you'd really break your timing, but it also prevents moving the vector. — MSalters, May 03 '21 at 08:57
Thank you for the improvements and explanations! Now the execution time is ~1.7s. — Andy, May 03 '21 at 19:49

Arty · Answer 2 · 2021-05-08T18:29:50.093

I decided to implement Merkle Root and SHA-256 computation from scratch, full SHA-256 implemented, using SIMD (Single Instruction Multiple Data) approach, known for SSE2, AVX2, AVX512.

My code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.

Here I provide C++ implementation, also I made Python implementation with same speed (because it uses C++ code in the core), for Python implementation see related post. Python implementation is definitely easier to use than C++.

My code is quite complex, both because it has full SHA-256 implementation and also because it has a class for abstracting any SIMD operations, also many tests.

First I provide timings, made on Google Colab because they have quite advanced AVX2 processor there:

MerkleRoot-Ossl 1274 ms
MerkleRoot-Simd-GEN-1 1613 ms
MerkleRoot-Simd-GEN-2 1795 ms
MerkleRoot-Simd-GEN-4 788 ms
MerkleRoot-Simd-GEN-8 423 ms
MerkleRoot-Simd-SSE2-1 647 ms
MerkleRoot-Simd-SSE2-2 626 ms
MerkleRoot-Simd-SSE2-4 690 ms
MerkleRoot-Simd-AVX2-1 407 ms
MerkleRoot-Simd-AVX2-2 403 ms
MerkleRoot-Simd-AVX2-4 489 ms

Ossl is for testing OpenSSL implementation, the rest is mine implementation. AVX512 has even more improvement in speed, here it is not tested because Colab has no AVX512 support. Actual improvement in speed depends on processor capabilities.

Compilation is tested both in Windows (MSVC) and Linux (CLang), using following commands:

Windows with OpenSSL support cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 -DSHS_HAS_OPENSSL=1 /MD -Id:/bin/OpenSSL/include/ /link /LIBPATH:d:/bin/OpenSSL/lib/ libcrypto_static.lib libssl_static.lib Advapi32.lib User32.lib Ws2_32.lib, provide your directory with installed OpenSSL. If OpenSSL support is not needed use cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1. Here also instead of of AVX2 you may use SSE2 or AVX512. Windows openssl can be downloaded from here.
Linux CLang compilation is done through clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe -DSHS_HAS_OPENSSL=1 -lssl -lcrypto if OpenSSL is needed and if not needed then clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe. As you can see most recent clang-12 is used, to install it do bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" (this command is described here). Linux version automatically detects current CPU architecture and uses best SIMD set of instructions.

My code needs C++20 standard support, as it uses some advanced features for easier implementing all things.

I implemented OpenSSL support in my library only to compare timings to show that my AVX2 version is 3-3.5x times faster.

Also providing timings done on GodBolt, but those are only for example of AVX-512 usage, as GodBolt CPUs have advanced AVX-512. Don't use GodBolt to actually measure timings, because all the timings there jump up to 5x times up and down, seems because of active processes eviction by operating system. Also providing GodBolt link for playground (this link may have a bit outdated code, use newest link to code at the bottom of my post):

MerkleRoot-Ossl 2305 ms
MerkleRoot-Simd-GEN-1 2982 ms
MerkleRoot-Simd-GEN-2 3078 ms
MerkleRoot-Simd-GEN-4 1157 ms
MerkleRoot-Simd-GEN-8 781 ms
MerkleRoot-Simd-GEN-16 349 ms
MerkleRoot-Simd-SSE2-1 387 ms
MerkleRoot-Simd-SSE2-2 769 ms
MerkleRoot-Simd-SSE2-4 940 ms
MerkleRoot-Simd-AVX2-1 251 ms
MerkleRoot-Simd-AVX2-2 253 ms
MerkleRoot-Simd-AVX2-4 777 ms
MerkleRoot-Simd-AVX512-1 257 ms
MerkleRoot-Simd-AVX512-2 741 ms
MerkleRoot-Simd-AVX512-4 961 ms

Examples of my code usage can be seen inside Test() function that tests all functionality of my library. My code is a bit dirty because I didn't want to spend very much time creating beautiful library, rather just to make a Proof of Concept that SIMD-based implementation can be considerably faster than OpenSSL version.

If you really want to use my boosted SIMD-based version instead of OpenSSL and if you care for speed very much, and you have questions about how to use it, please ask me in comments or chat.

Also I didn't bother about implementing multi-core/multi-threaded version, I think it is obvious how to do that and you can and should implement it without difficulties.

Providing external link to the code below, because my code is around 51 KB in size which exceeds allowed 30 KB of text for StackOverflow post.

sha256_simd.cpp

Running this code today gives me ```MerkleRoot-Ossl 481 ms MerkleRoot-Simd-GEN-1 1243 ms MerkleRoot-Simd-GEN-2 1292 ms MerkleRoot-Simd-GEN-4 612 ms MerkleRoot-Simd-GEN-8 315 ms MerkleRoot-Simd-GEN-16 401 ms MerkleRoot-Simd-SSE2-1 561 ms MerkleRoot-Simd-SSE2-2 550 ms MerkleRoot-Simd-SSE2-4 694 ms MerkleRoot-Simd-AVX2-1 327 ms MerkleRoot-Simd-AVX2-2 331 ms MerkleRoot-Simd-AVX2-4 437 ms ``` So it's not clearly better than OpenSSL 1.1.1.k, this is on Arch linux on AMD Ryzen 5 3600. — Reimundo Heluani, Jul 14 '21 at 15:31
@ReimundoHeluani This post was done around 1 month ago, so all measurements are quite recent. But of cause they depend on CPU. Place were I measured gave `3.5x` times improvement over OpenSSL version. Other CPUs may not give much improvements. The main reason why I've made this code is because of OP wanted to have as fast code as possible then probably he has quite good CPUs in his possession, maybe even access to DataCenter. So my code will give quite good improvements in this case of very modern CPUs. Also in your case you got `315ms` at best, which `1.52x` times faster than OpenSSL `481ms`. — Arty, Jul 14 '21 at 15:39
@ReimundoHeluani Also if you're doing many-days calculation, like crypto Mining then even `1.5x` times improvement will save you a lot of time and money and electricity. — Arty, Jul 14 '21 at 15:40
Indeed it seems your solution is really the fastest and single header I tested on this Ryzen. Any chance you can modify the merkle computation to include the situation where there's a limit added: you pass a vector of size 8, but want to consider a tree with 32 leaves, the last 24 are padded by zero? I figured it may be worth asking you before me trying to follow your code and adding this — Reimundo Heluani, Jul 14 '21 at 18:21
@ReimundoHeluani If last 24 are padded with zeros then you can just pass 32 entries with 24 zeros at the end, pass them yourself. The only reason I see to make such modification in my code is when it will give some speed improvement. If no speed improvement will be done by this change then there is no actual point of implimenting such corner cases on my side. Just pass yourself all the input data the way you like, padded with zeros or any other modifications to data. BTW, is it some common case for Merkle Root when users pass zero values at the end? — Arty, Jul 15 '21 at 03:06
@ReimundoHeluani Following my last comment above I can say that speed improvement can be actually achieved due to the fact that last zero values will have same hash. But I think it is even more general case - not only zero values will have equal hash, but also any equal values. But I don't know details about merkle roots, is it often happenning that people pass a lot of equal values (including case of equal zero values/padding)? Is it some special trait of Merkle Root that there is often appear a lot of equal values? Is it really worth doing such optimization for all people? — Arty, Jul 15 '21 at 03:12
The typical situation I have is that my vector is of size 150K and the limit is of size 2^40, so one needs to cache a table of zero hashes (0, hash(0, 0), hash(hash(0,0),hash(0,0)),...) and only compute the effective part of the tree. — Reimundo Heluani, Jul 15 '21 at 09:10
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234895/discussion-between-reimundo-heluani-and-arty). — Reimundo Heluani, Jul 15 '21 at 09:11

How to improve the speed of merkle root calculation in C++?

2 Answers2

Linked