I am implementing Q-learning algorithm and I observed that my Q-values are not converging to optimal Q-values even though the policy seems to be converging. I defined the action selection strategy as epsilon-greedy and epsilon is decreasing by 1/N starting from 1(N being the total number of iterations). That way in the earlier iterations the algorithm explores random states then this rate gradually decreases leading to exploitation. In addition, I defined the learning rate as 1/N_t(s,a) where N_t(s,a) is the total number of times (s,a) is visited.
Everything seems to be correct but since I can't get to the optimal Q-values I started looking into different strategies and in the meantime got super confused. I know that convergence is achieved when all (s,a) pairs are visited infinitely often. Isn't this equivalent to saying all (s,a) pairs are explored many times? In other words, why do we need exploitation for convergence? What if we don't exploit and just focus on exploring? If we do that we search all of the solution space, hence shouldn't that be enough to find an optimal policy?
Also, when its said the Q-values converge to optimal, does only the max_a[Q(s,a)] converge to its optimal value or all Q(s,a) values converge to their optimal value?
Probably there is a simple answer to all of these however even though I checked a lot of resources and similar threads I still couldn't figure out the logic behind exploitation. Thanks a lot for your time in advance!