We have created a custom searcher. Our document size is around 400 000. Latency remains in less than 100ms but when we are doing load test, it does not give QPS of more than 80, and latency also increases up to 4-5 seconds. We are using 9 node cluster (c5.2xlarge - 8vcpu and 16GB RAM) in group distribution (3 groups of size 2 with replication 3 and searchable copies 3). We tried different distributions but could not gain speed. We tried with different values of tuning parameters with even large compute instances
<requestthreads>
<search>64/128</search>
<persearch>1</persearch>
<summary>16</summary>
</requestthreads>
What should be the better approach to find the bottleneck? With such a big cluster, we should be able to achieve 500 QPS for 500k records.