I think you are correct. I would also have expected that the update contains the TD error term, which should be reward + discount * q_hat_next - q_hat
.
For reference, this is the implementation:
if done: # (terminal state reached)
w += alpha*(reward - q_hat) * q_hat_grad
break
else:
next_action = policy(env, w, next_state, epsilon)
q_hat_next = approx(w, next_state, next_action)
w += alpha*(reward - discount*q_hat_next)*q_hat_grad
state = next_state
And this is pseudo-code from Reinforcement Learning:
An Introduction (by Sutton & Barto) (page 171):

As the implementation is TD(0), n
is 1. Then the update in the pseudo-code can be simplified:
w <- w + a[G - v(S_t,w)] * dv(S_t,w)
becomes (by substituting G == reward + discount*v(S_t+1,w))
)
w <- w + a[reward + discount*v(S_t+1,w) - v(S_t,w)] * dv(S_t,w)
Or with the variable names in the original code example:
w += alpha * (reward + discount * q_hat_next - q_hat) * q_hat_grad
I ended up with the same update formula that you have. Looks like a bug in the non-terminal state update.
Only the terminal case (if done
is true) should be correct because then q_hat_next
is always 0 by definition, as the episode is over and no more reward can be gained.