For the life of me I don't get what "num_envs_per_worker" does. If the limiting factor is policy evaluation why would we need to create multiple environments? Wouldn't we need to create multiple policies?
ELI5 please?
The docs say:
Vectorization within a single process: Though many envs can achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs. You can configure {"num_envs_per_worker": M} to have RLlib create M concurrent environments per worker. RLlib auto-vectorizes Gym environments via VectorEnv.wrap().