In order to find out how well Solr 7.6.0 performs on Azure VMs (specifically the replica recovery scenario) I did a performance test.
The objective was to find out how quickly a failed replica node could recover in a shard.
What I did ?
- I ran a load test for 48 hours
- Captured the full recovery time of a replica node by manually by restarting it at a point when each shard has begun hosting 250+ GB of data (close to RAM of the VM with 32-core machine)
Experiment's findings
One of the biggest red marks with this experiment was the time it took for replicas to recover:
It took ~2 hours for 350 GB recovery which means 350 / (2*60) = 2.91 GB / minute which is 2.91 * 8 / 60 = 0.4 Gb/s (not that it is small b signifying bit, not Byte).
So it seems that Solr7.6.0 is just able to use 0.4 Gb/s bandwidth for transferring the data from its leader to failed replica.
To confirm whether the respective VM supports more bandwidth I ran
iperf3
on these VMs and found out that it was supporting 11 Gb/s (however as per Azure its should be 16 Gb/s which is not deal breaker here).
So, the issue here is that why Solr is only able to utilize 0.4 Gb/s bandwidth while the VMs are able support ~11 Gb/s ?