Is there a limitation in Solr 7.6.0 that it cannot use the system's full data transfer capacity?

Question

In order to find out how well Solr 7.6.0 performs on Azure VMs (specifically the replica recovery scenario) I did a performance test.

The objective was to find out how quickly a failed replica node could recover in a shard.

What I did ?

I ran a load test for 48 hours
Captured the full recovery time of a replica node by manually by restarting it at a point when each shard has begun hosting 250+ GB of data (close to RAM of the VM with 32-core machine)

Experiment's findings

One of the biggest red marks with this experiment was the time it took for replicas to recover:

It took ~2 hours for 350 GB recovery which means 350 / (2*60) = 2.91 GB / minute which is 2.91 * 8 / 60 = 0.4 Gb/s (not that it is small b signifying bit, not Byte).

So it seems that Solr7.6.0 is just able to use 0.4 Gb/s bandwidth for transferring the data from its leader to failed replica.

To confirm whether the respective VM supports more bandwidth I ran

iperf3

on these VMs and found out that it was supporting 11 Gb/s (however as per Azure its should be 16 Gb/s which is not deal breaker here).

So, the issue here is that why Solr is only able to utilize 0.4 Gb/s bandwidth while the VMs are able support ~11 Gb/s ?

Replication happens over HTTP, so it'll depend on the throughput of Jetty. There can be severe overhead when making HTTP requests through jetty (and most other httpds). In that case it might be more suitable to handle replication yourself on a lower level and spin up a new replica that way. I also suggest asking the question on the `solr-user` mailing list to see if anyone have specific experience with Azure and Solr replication. — MatsLindh, Sep 21 '19 at 21:23

Is there a limitation in Solr 7.6.0 that it cannot use the system's full data transfer capacity?

What I did ?

Experiment's findings

0 Answers0