I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean?
I can't find anything online that describes policy iteration and value iteration in terms of direction, and to my knowledge this is not a common way to explain the difference between them.
One possibility is that someone was referring to visual impression of values propagating in value iteration. After the first sweep, values are correct on a 1 timestep horizon. Each value correctly tells you what to do to maximize your cumulative reward if you have 1 tilmestep to live. This means that that states that transition to the terminal state and receive a reward have positive values while most everything else is 0. Each sweep, the values become correct for one timestep longer of a horizon. So the values creep backwards from the terminal state towards the start state as the horizon expands. In policy iteration, instead of just propagating values back one step, you calculate the complete value function for the current policy. Then you improve the policy and repeat. I can't say that this has a forward connotation to it, but it certainly lacks the backwards appearance. You may want to see Pablo's answer to a similar question for another explanation of the differences that may help you contextualize what you have heard.
It's also possible that you heard about this forwards-backwards contrast in regards to something related, but different; implementations of temporal difference learning algorithms. In this case, the direction refers to the direction in which you look when making an update to state-action values; forwards means you need to have information about the results of future actions, while backwards means you only need information about things that happened previously. You can read about this in chapter 12 of Reinforcement Learning: An Introduction 2nd edition.
Why does policy iteration have to do a bunch of value function calculations when value iteration seemingly just does one that ends up being optimal? Why does value iteration converge in just one step?
In policy evaluation, we already have a policy and we're just calculating the value of taking actions as it dictates. It repeatedly looks at each state and moves the state's value towards the values of the states that the policy's action will transition to (until the values stop changing and we consider it converged). This value function is not optimal. It's only useful because we can use it in combination with the policy improvement theorem to improve the policy. The expensive process of extracting a new policy, which requires us to maximize over actions in a state, happens infrequently, and policies seem to converge pretty quickly. So even though the policy evaluation step looks like it would be time consuming, PI is actually pretty fast.
Value iteration is just policy iteration where you do exactly one iteration of policy evaluation and extract a new policy at the same time (maximizing over actions is the implicit policy extraction). Then you repeat this iterate-extract procedure over and over until the values stop changing. The fact that these steps are merged together makes it look more straightforward on paper, but maximizing at each sweep is expensive and means that value iteration is often slower than policy iteration.