There is no way to really avoid them, that's why RAW hazards are called true dependencies. Instructions have to wait for their inputs to be ready before they can execute. (With OoO exec, normally CPUs will dispatch the oldest-ready-first instructions / uops to execution units, for example on Intel CPUs.)
True dependencies aren't something you "solve" in the sense of making them go away, they're the essence of computation, the way multiple computations on the same numbers are glued together to form an algorithm. Other hazards (WAR and WAW) are just implementation details, reusing the same architectural register for something different.
Sometimes you can structure an algorithm to have shorter dependency chains, once things are already nailed down into machine code, the CPU pretty much just has to respect them, with at best out-of-order exec to overlap independent dep chains.
For loads, in theory there's value-prediction, but AFAIK no real CPU is doing that. Mispredictions are expensive, just like for branches. That's why you'd only want to consider that for high-latency stuff like loads that miss in cache, otherwise the gains won't outweigh the costs. (Even then, it's not done because the gains don't outweigh the costs even for loads, including the power / area cost of building a predictor.) As Paul Clayton points out, branch prediction is a form of value prediction (predicting the condition the load was based on). The more instructions you can keep in flight at once with OoO exec, the more you stand to lose from mispredicts, but real CPUs do predict / speculate for memory disambiguation (whether a load reloads an earlier store to an unknown address or not), and (on CPUs like x86 with strongly-ordered memory models) speculating that early loads will turn out to be allowed; as well as the well known case of control dependencies (branch prediction + speculative execution).
The only thing that helps directly with dependency chains is keeping instruction latencies low. e.g. in the case of your example, #3 is just a register copy, which modern x86 CPUs can do with zero latency (mov-elimination), so another instruction dependent on R9 wouldn't have to wait an extra cycle beyond #2 producing a result, because it's handled during register renaming instead of by an execution unit reading an input and producing an output the normal way.
Obviously bypass forwarding from the outputs of execution units to the inputs of the same or others is essential to keep latency low, same as in an in-order classic RISC pipeline.
Or more conventionally, by improving execution units, like AMD Bulldozer family had 2-cycle latency for most SIMD integer instructions, but that improved to 1 cycle for AMD's next design, Zen. (Scalar integer stuff like add
was always 1 cycle on any sane high-performance CPU.)
OoO exec with a large enough window size lets you overlap multiple dep chains in parallel (as in this experiment, and of course software should aim to have enough instruction-level parallelism (ILP) for the CPU to find and exploit. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an example of doing that by summing into multiple accumulators for a dot-product.)
This is also useful for in-order CPUs if done statically by a compiler, where techniques like "software pipelining" are a big deal to overlap execution of multiple loop iterations because HW isn't finding that parallelism for you. Or for OoO exec CPUs with a limited window size, for loops with long but not loop-carried dependency chains within each iteration.
As long as you're bottlenecked on something other than latency / dependency chains, true dependencies aren't the problem. e.g. a front-end bottleneck, ideally maxed out at the pipeline width, and/or all relevant back-end execution units busy every cycle, mean that you couldn't get more work through the pipeline even if it was independent.
Of course, in a lot of code there are enough dependencies, including through memory, to not reach that ideal situation.
Simultaneous Multithreading (SMT) can help to keep the back-end fed with work to do, increasing throughput by having the front-end of one physical core read multiple instruction streams, acting as multiple logical cores. This effectively creates ILP out of thread-level parallelism, which is useful if software can scale efficiently to more threads, exposing enough TLP to keep all the logical cores busy.
Related: