[UPDATE1]
Here is my proposal with thread synchronization (but based on our discussion with @IraBaxter I'm not sure anymore that my way gives any advantages):
When algorithm starts only one thread is needed until you get to first fork. When this one thread gets there it should put all possible outcomes (left, right, middle) to stack and stop itself. Then since there are some elements in stack several threads are activated to start from edges stored in stack. When each of these threads reaches a fork all outcomes are put to the stack and threads stop themselves (not all at once, each does when it needs) and take edges from stack. And so on. Every time when any thread is stopped (regardless because of fork or dead end) it switched to mode waiting for edges in stack (or takes one if stack is not empty). And every time when some edge is added to the stack threads are notified about non-emptiness of the stack.
I used term "edge" here in a meaning of position of fork plus direction where to go from given fork. Stack gives you depth-first property of algorithm.
PS: It's possible to optimize this approach by reducing number of synchronization points. We can collect forking edges in separate worklist for every thread and don't stop this thread until it reaches dead-end. We exclude from this worklist those edges where this thread decided to go on each fork. Then we migrate local worklist to global one when thread reaches dead-end. So synchronization is used for idle threads to start from points from global worklist.