21

During trying to achieve the performance with Hyperledger Fabric which IBM team reported in their article Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains, I faced some problems and errors. I collected all useful information and want to share it with the HF community. Also, I have a couple of questions to the Fabric developers about its performance.

Target description

Hyperledger Fabric v1.1.0 network deployed using Cello on four c5.9xlarge (36vCPU) aws instances:

{
    fabric001: {
      cas: [],
      peers: ["anchor@peer1st.main"],
      orderers: ["orderer1st.orderer"],
      zookeepers: ["zookeeper1st"],
      kafkas: ["kafka1st"]
    },
    fabric002: {
      cas: [],
      peers: ["worker@peer2nd.main"],
      orderers: ["orderer2nd.orderer"],
      zookeepers: ["zookeeper2nd"],
      kafkas: ["kafka2nd"]
    },
    fabric003: {
      cas: [],
      peers: ["worker@peer3rd.main"],
      orderers: ["orderer3rd.orderer"],
      zookeepers: ["zookeeper3rd"],
      kafkas: ["kafka3rd"]
    },
    fabric004: {
      cas: ["ca1st.main"],
      peers: [],
      orderers: ["orderer4th.orderer"],
      zookeepers: ["zookeeper4th"],
      kafkas: ["kafka4th"]
    }
}

TLS is disabled.

Fabric channel configuration (all others parameters are the default):

BatchTimeout: 1s
BatchSize:
    MaxMessageCount: 500
    AbsoluteMaxBytes: 200 MB
    PreferredMaxBytes: 50 MB

I performed tests for both CouchDB and LevelDB as a state database. I use official Fabcar chaincode (Golang implementation) for my tests. I created simple nodejs app which interacts with the Fabric network using SDK and exposes HTTP API for load tests. This app is stateless and can be easily scaled. For load testing, I'm using tool YandexTank. I've performed two kinds of tests with high load: query (requests via peer001 to the Fabric state when blockchain is empty) and insert (transactions within the blockchain).

Results

CouchDB as a state database

Based on this I can conclude that Fabric Peer has problems with the CouchDB connection under the load.

My questions: Does Fabric comminity know about this bug? Do you have plans how to solve it?

LevelDB as a state database

  • Query results: https://overload.yandex.net/102035. CPU and Memory usage of the fabric001 containers on the figure below: fabric001 container instances (leveldb, query) There are no any errors from the blockchain, I just see latency degradation.
  • Insert results: https://overload.yandex.net/102040. CPU and Memory usage of the fabric001 containers on the figure below: fabric001 container instances (leveldb, insert) Aggressive latency degradation starts at ~850 rps. No errors from the blockchain.

My questions: What is the cause of this latency degradation? Why I can't achieve 3500 rps performance that IBM report in their article? What plans does Fabric community have on improving the performance?

Dmitry Pugachev
  • 467
  • 4
  • 8
  • out of curiosity... can you repeat the levelDB experiment with the latest master? :) – yacovm May 15 '18 at 10:03
  • Is it supposed I have to build docker images by myself? I can try later, but I need some information from developers. Can I build only Peer image from the master and deploy it with the rest Fabric elements of 1.1.0 version? – Dmitry Pugachev May 15 '18 at 10:28
  • yeah you can build the images locally via fetching the latest master branch and running "make unit-test" – yacovm May 15 '18 at 13:43
  • The first 2 images seem like they are from instance fabric003, not fabric001 as stated in the description.Is that the case? – adnan.c May 15 '18 at 16:59
  • Also, the first two images seems like they are the same image (the shape of the curve and timestamp are similar). It'd be great if you can confirm/update them – adnan.c May 15 '18 at 17:06
  • @adnan.c yes, thank you. I've updated the first figure and fixed comment for the second one. – Dmitry Pugachev May 15 '18 at 17:11
  • 1
    @DmitryPugachev Hi! Not sure if you have repeated the tests again after some months. Curious to see if it has improved – emiliomarin Dec 07 '18 at 10:48

1 Answers1

14

Fabric is a queueing system. With a high load, the waiting time increases exponentially (queueing property) and hence the transaction latency. However, for golevelDB, we should get at least 2000 tps with a low latency.

From the CPU utilization plot, it looks like only 16 vCPUs are utilized fully out of 36 vCPUs. What value is set for validatorPoolSize in core.yaml for each peer? You can set this value equal or lesser than the block size and check whether the throughput increases.

The performance would differ based on the

  1. workload (fabcar vs fabcoin),
  2. disk (hdd vs ssd, local vs network attached),
  3. load generator (CLI vs SDK),
  4. load generation method (open system vs closed system vs some distribution) and
  5. network bandwidth (at least 1.6 Gbps for 2700 tps).

Also, ensure that the load generator is not becoming a bottleneck. It would be better the latency can be divided further into (endorsement latency, ordering latency, commit latency) and collect other resource utilization such as network and disk so that the bottleneck can be identified easily.

You can refer to our technical paper titled Performance Benchmarking and Optimizing Hyperledger Fabric. We have conducted a comprehensive empirical study. With levelDB, we should get at least 2000 tps with a low latency.

  • @senthilnathan Thanks for the answer, I really appreciate it. Maybe you can say few words about CouchDB as a state database? – Dmitry Pugachev May 15 '18 at 13:27
  • 1
    @DmitryPugachev :) As golevelDB is an embedded database, we get more throughput as compared to the CouchDB. With a CouchDB as a stateDB, for every get/putState, peer needs to issue a GET/POST REST API call over a secure HTTP. As a result, the performance degrades. We got a maximum of 700 tps with CouchDB. For a detailed answer, please refer to section V.D, VI.C, and VI.D in the above paper. – senthil nathan May 15 '18 at 13:51
  • @senthilnathan this is great!. What Hyperledger Fabric version was used during the tests? Is there an up to date version of this document? Thanks. – Greivin López Oct 30 '18 at 14:21