Let's take your example
I1:
add r1, r1, 1000
I2:
sub r2, r2, r1
I3:
add r9, r9, 1
and suppose we have a single pipelined ALU, say, with x > 1 stages. Each instruction in the above sequence takes x cycles from issue to writeback, since they all use the ALU.
First consider how this code would run without reordering. I1 is issued at cycle 1. There is a read-after-write (RAW) dependence of I2 on I1 via r1, so I2 has to wait for I1 to complete execution before it can even issue, which happens at cycle x + 1. Since there is no dependence of I3 on the previous instructions, it can issue the cycle after I2 is issued, at cycle x + 2. I3 completes after cycle 2 x + 1.
Now consider what would happen if the CPU can execute out-of-order. The CPU issues any instruction whose inputs and output are all different from the outputs of all executing instructions (the restriction that the output must differ from outputs of executing instructions can be relaxed via techniques such as register renaming). Here, the CPU begins by issuing I1 on cycle 1. On cycle 2, it attempts to issue I2, but cannot do so because of the RAW dependence, but it can issue I3, which it does. I1 is completed after cycle x, so I2 can be issued at cycle x + 1. I3 completes after cycle x + 1. I2 completes after cycle 2 x.
So, 1 cycle was saved by executing I3 ahead of I2, which serves to hide the latency due to the RAW dependence. If x is large, then this savings is not very much. But this is really because our program is small. Imagine there were another instruction after I3, as in
I1:
add r1, r1, 1000
I2:
sub r2, r2, r1
I3:
add r9, r9, 1
I4:
add r4, r4, 4
You can show that without reordering, the CPU as described would take 2 x + 2 cycles to run this code. With reordering, it would still just take 2 x cycles to run. We now have a savings of 2 cycles since we were able to continuing issuing independent instructions to hide the latency between I1 and I2. Obviously, the savings increases the more independent instructions we have.
Note
These arguments do not require that the CPU implement register renaming or be superscalar (defined as the ability to issue multiple instructions per cycle). We just require that the execution units are pipelined or are duplicated, which can be shown to be organizationally equivalent.