How to improve the speed of merkle root calculation?

Question

My implementation in Python calculates the merkle root hash for ~1500 input hashes:

import numpy as np
from binascii import unhexlify, hexlify
from hashlib import sha256

txids = np.loadtxt("txids.txt", dtype=str)

def double_sha256(a, b):
    inp = unhexlify(a)[::-1] + unhexlify(b)[::-1]
    sha1 = sha256(inp).digest()
    sha2 = sha256(sha1).digest()
    return hexlify(sha2[::-1])


def calculate_merkle_root(inp_list):
    if len(inp_list) == 1:
        return inp_list[0]
    out_list = []
    for i in range(0, len(inp_list)-1, 2):
        out_list.append(double_sha256(inp_list[i], inp_list[i+1]))
    if len(inp_list) % 2 == 1:
        out_list.append(double_sha256(inp_list[-1], inp_list[-1]))
    return calculate_merkle_root(out_list)

for i in range(1000):
    merkle_root_hash = calculate_merkle_root(txids)

print(merkle_root_hash)

Since the merkle root is calculated 1000 times, it takes ~5ms for one calculation:

$ time python3 test.py 
b'289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e'

real    0m5.132s
user    0m5.501s
sys     0m0.133s

How could I improve the speed of the calculation? Can this code be optimized?

So far, I have tried to unroll the recursive function in Python and C++. However, the performance did not increase, it took ~6ms.

EDIT

The file is available here: txids.txt

EDIT 2

Due to the suggestion in a comment, I removed the unnecessary steps of unhexlify and hexlify. Before the loop the list is prepared once.

def double_sha256(a, b):
    inp = a + b
    sha1 = sha256(inp).digest()
    sha2 = sha256(sha1).digest()
    return sha2

def map_func(t):
    return unhexlify(t)[::-1]
txids = list(map(map_func, txids))

for i in range(1000):
    merkle_root_hash = calculate_merkle_root(txids)
    merkle_root_hash = hexlify(merkle_root_hash[::-1])

Now the execution is ~4ms:

$ time python3 test2.py 
b'289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e'

real    0m3.697s
user    0m4.069s
sys     0m0.128s

try unhexing once, and hexing the final output rather than unhexing, to hex the SHAs, to then unhex them once more. Also, i'm not sure why you are using SHA256 twice — Loveen Dyall, May 02 '21 at 13:28
>50% of the time of the script is spent in computing sha256 hashs. Why do you do `merkle_root_hash = calculate_merkle_root(txids)` in a loop? Does the real loop should compute independent work? — Jérôme Richard, May 02 '21 at 13:31
@LoveenDyall thank you for the suggestion! I applied it in the second edit. — Andy, May 02 '21 at 14:35
@JérômeRichard yes, in the final program the list `txids` is updated each time. Additionally, the loop makes the performance measurement more accurate. — Andy, May 02 '21 at 14:37
I just created C++ and Python library that is `3.5x` times faster than OpenSSL version in solving your task and `7.3x` times faster than `hashlib` version. Library based on [SIMD](https://en.wikipedia.org/wiki/SIMD) instructions. See my posts below - [Python version](https://stackoverflow.com/a/67450238/941531) and [C++ version](https://stackoverflow.com/a/67450494/941531). Please put a look, I devoted quite a lot of time! UpVoted your both questions by the way, they are very good and interesting! — Arty, May 08 '21 at 18:33

Arty · Answer 1 · 2021-05-08T18:43:25.623

I decided to implement SHA-256 fully from scratch and using SIMD instructions set (read about them here SSE2, AVX2, AVX512).

As a result my code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.

I also created related second post regarding C++ version, see it here. Read C++ post to figure out more details about my library, this Python post is more high level.

First providing timings:

simple 3.006
openssl 1.426
simd gen 1 1.639
simd gen 2 1.903
simd gen 4 0.847
simd gen 8 0.457
simd sse2 1 0.729
simd sse2 2 0.703
simd sse2 4 0.718
simd sse2 8 0.776
simd avx2 1 0.461
simd avx2 2 0.41
simd avx2 4 0.549
simd avx2 8 0.521

Here simple is hashlib's version close to that provided by you, openssl stands for OpenSSL version, remaining simd versions are mine SIMD (SSE2/AVX2/AVX512) implementations. As you can see AVX2 version is 3.5x times faster than OpenSSL version and 7.3x times faster than native Python's hashlib.

Timings above were done in Google Colab as they have quite advanced AVX2 CPUs available.

Providing code of the library at the bottom, as code is very huge it is posted as separate links because it doesn't fit into 30 KB limit of StackOverflow. There are two file sha256_simd.py and sha256_simd.hpp. Python's file contains timings and usage examples and also Cython-based wrapper to use my C++ library shipped in .hpp file. This python file contains everything needed to compile and run code, just place both of these files nearby and run python file.

I tested this program/library both on Windows (MSVC compiler) and Linux (CLang compiler).

Examples of usage of my library is located in merkle_root_simd_example() and main() functions. Basically you do following things:

First import my library through mod = sha256_simd_import(cap = 'avx2'), do this only one time per program run, don't do this multiple times, remember this returned module into some global variable. In cap parameter you should put whatever your CPU supports, it can be gen or sse2 or avx2 or avx512 in order of increasing technology complexity and improved speed. gen is generic non-SIMD operations, sse2 is 128-bit operations, avx2 is 256-bit operations, avx512 is 512-bit operations.
After importing use imported module for example like mod.merkle_root_simd('avx2', 2, txs). Here you put again one of gen/sse2/avx2/avx512 technology. Why again? First time when importing you put compilation option which tells compiler to support given and all below technologies. Here you put SIMD technology that will be used for merkle-root call, this technology can be lower (but not higher) than compilation technology. For example if you compiled for avx2 then you can use library for gen or sse2 or avx2, but not for avx512.
You can see in 2) that I used options ('avx2', 2, txs), here 2 means parallelization parameter, it is not multi-core but single core parallelization, meaning that two avx2 registers will be computed in a row. You should put 1 or 2 or 4 or 8, whatever is giving a faster computation for you.

In order for library to be used you have to have installed two things - one is compiler (MSVC for Windows and CLang (or GCC) for Linux), second - install one time Cython module through python -m pip install cython, Cython is an adavnced library for programming C++ code inside Python, here it acts as a thin wrapper between my Python's .py and C++'s .hpp modules. Also my code is programmed using most modern C++20 standard, be aware of this, you have to have most updated C++ compiler to be able to compile my code, for that download latest MSVC on Windows and/or latest CLang for Linux (through command bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" which is described here).

In .py file you could see that I sometimes provide extra params has_ossl = True, win_ossl_dir = 'd:/bin/openssl/', these two params are needed only if you need OpenSSL version being compiled-in into my library. Windows openssl can be downloaded from here. Later openssl version can be used through mod.merkle_root_ossl(txs), just providing single parameter with transactions.

In all versions of functions in my .py module you need to provide list of bytes for transactions, meaning that if you have hex transactions then you have to unhexlify them first. Also all functions return bytes hash meaning that you have to hexlify it if you need. This bytes-only transfer there and back is for performance reason only.

I understand that my code is very complex to understand and use. So if you are very serious about your desire to have fastest code then please ask me questions about how to use and understand my code, if you want. Also I should say that my code is quite dirty, I didn't mean to make a clean-shiny library for all people use, I just wanted to make Proof-of-Concept that SIMD version is considerably faster than hashlib's version and even openssl version, of cause only if your CPU is quite advanced to support at least one of SSE2/AVX2/AVX512, most CPUs support SSE2, but not all support even AVX2 and AVX512.

sha256_simd.py

sha256_simd.hpp

Interesting. Note that new Intel processors (sunny-cove & ice-lake/) have a dedicated SHA computing instructions that will likely be used by third-party libraries like OpenSSL in a near future for sake of performance. — Jérôme Richard, May 14 '21 at 18:48
@JérômeRichard Nice note! I thought that there are such instructions in some processors but didn't search for them. Yes if one has advanced processor with SHA instructions then it will be not difficult to implement Sha256 class like above in 10-15 lines of code. If OP has such processor and needs support I can extend my library and implement use for these instructions. — Arty, May 15 '21 at 06:29
@JérômeRichard It seems to me that SHA instructions set is quite rare set, few processors support it. I tested 5 different PCs, none has SHA set, although have quite advanced AVX512. For example [here godbolt](https://godbolt.org/z/q3o5Gsh8z) has AVX512 support, but still has no SHA support, program just crashes. So before implementing these support in my code I want to check if OP has such instructions, if not then it is not worth spending time devoting to this support which will not be used, unless some other people will be interested in it. — Arty, May 15 '21 at 07:02
Yeah, this is only on recent processors (especially for Intel processors). Intel launched Ice-Lake in mid-2019, but targeted only the notebook/mobile sector. Desktop/Server processors supporting SHA instructions has been released just 6 month ago and are thus barely used in computers yet (this is why GodBolt does not have the support yet since AFAIK they use Intel's server processors so far). However, AMD supports SHA instructions in all Zen processors since 2017 and they have >30% market share now. — Jérôme Richard, May 15 '21 at 11:44
@JérômeRichard OK, then I'm waiting for OP's response regarding if he/she needs this support in my class/code. Or other people who will be interested in such support. Because right now my code covers all 100% processors, and this SHA instruction set implementation will only cover small part of PCs so don't know if it is worth starting to implement it for me. — Arty, May 15 '21 at 12:01
@JérômeRichard Also if I decide to implement these SHA instructions then I don't know where I can test the code. Don't want to post untested code, because it may not work. I don't have new CPUs anywhere in my reach. Do you know any online sites that can run my C++ code on SHA-supporting CPUs? Like GodBolt site but with newer CPUs. — Arty, May 15 '21 at 12:11
Most servers are sadly running on Intel Xeon processors nowadays and very few will be more recent than 6 month. So I guess it would be very difficult to find a free online IDE providing such an instruction yet. That being said, I do not think you need to implement that. The comment about the SHA instruction was just for information and to help future readers ;) . — Jérôme Richard, May 15 '21 at 12:42
Do you have Cython bindings for just the sha256 on its own? That would be much more versatile for me. Or do you know of some good bindings somewhere (only for sha256) — AustEcon, Oct 11 '21 at 00:09
@AustEcon Do you need specifically Cython? Or just any kind of binding between C and Python for sha256? Standard [hashlib](https://docs.python.org/3/library/hashlib.html) module is already such binding, because it is written in C and has Python interface. — Arty, Oct 11 '21 at 02:37

score 1 · Answer 2 · answered May 02 '21 at 15:09

1

In the last update (2 may 2021 at 17:00), the calls to sha256(value).digest() takes roughly 80% of the time on my machine. There are few possible solution to fix that.

The first is to parallelize the computation using multiprocessing assuming the work is independent for each iteration. Here is an example:

from multiprocessing.pool import Pool

# [...] same as in the question

def iteration(txids):
    merkle_root_hash = calculate_merkle_root(txids)
    merkle_root_hash = hexlify(merkle_root_hash[::-1])
    return merkle_root_hash

processPool = Pool()
res = processPool.map(iteration, [txids for i in range(1000)])

print(res[-1])

This is 4 times faster on my 6-core machine.

Another solution is to find a faster Python module that can compute multiple sha256 hashes at the same time to reduce the expensive C calls from the CPython interpreter. I am not aware of any package doing this.

Finally, one efficient solution is to (at least partially) rewrite the expensive calculate_merkle_root computation in C or C++ and run it in parallel. This should be significantly faster than your current code as this removes the function call overhead and the multiprocessing cost. There are many libraries to compute a sha256 hash (like the Crypto++ library).

answered May 02 '21 at 15:09

Jérôme Richard

41,678
6
29
59

thank you for your answer! Regarding the 3rd solution, I tried a C++ implementation, but it seems not to be more efficient: [test.cpp](https://gist.github.com/andy121090/85a745f9886b791d4a980d00fb75ac27).. I am not very fluent in C++, maybe this is not what you meant by "rewriting the expensive computation"? – Andy May 02 '21 at 17:54
I will also try the `multiprocessing` approach, but my initial try is to optimise the code as much as possible without parallelization. – Andy May 02 '21 at 17:57
@Andy for python, I think you reach the limit of the language in sequential (unless there is some algorithmic optimization I did not see). For the C++ code, this a good first try but there are plenty of optimizations you can apply. But first, you should 1. check compiler optimizations are enabled and 2. check the sha256 library you use is fast (as not all are equally fast). I think you could eventually open a new question for the C++ version. – Jérôme Richard May 02 '21 at 19:13
1

regarding C++ I created a new question: https://stackoverflow.com/questions/67364134/how-to-improve-the-speed-of-merkle-root-calculation-in-c – Andy May 03 '21 at 06:11

How to improve the speed of merkle root calculation?

2 Answers2

Linked