As others have stated, the range has to be 2^n-1, and even then, if it's done at run-time, you have problems.
On recent architectures (let's say, anything after P4 era) the latency on integer division instructions is between 26 and 50 or so cycles worst case. A multiply, in comparison, can be 1-3 cycles and can often be done in parallel much better.
The DIV instruction returns the quotient in EAX and the remainder in EDX. The "remainder" is free (the modulus is the remainder).
If you implement something where the range is variable at run-time, if you wish to use &, you have to:
a) check if the range is 2^n-1, if so use your & codepath: which is a branch, possible cache miss etc. etc. adding huge latency potential
b) if it is not 2^n-1, use a DIV instruction
Using a DIV instead of adding a branch into the equation (which is the potential to cost hundreds or even thousands of cycles in bad cases with poor cache eviction) makes DIV the obvious best choice. On top of that, if you are using & with a signed data type, conversions will be necessary (there is no & for mixed data types but there are for DIVs). In addition if the DIV is only used to branch from the modulus and the rest of the results aren't used, speculative execution can perform nicely; also performance penalties are further mitigated by multiple pipeline that can execute instructions in parallel.
You have to remember that if you are using real code, a lot of your cache will be filled with the data you are working on, and other code and data you will be working with soon or have just worked on. You really don't want to be evicting cache pages and waiting for them to page in because of branch mispredictions. In most cases with modulo, you are not just going i = 7; d = i % 4; you're using larger code that often calls a subroutine which itself is a (predicted and cached) subroutine call directly before. In addition you're probably doing it in a loop which itself is also using branch prediction; nested branch predictions with loops are handled pretty well in modern microprocessors but it just ends up being plain stupid to add to the predicting it's trying to do.
So to summarize, using DIV makes more sense on modern processors for a general usage case; it is not really an "optimization" for a compiler to generate 2^n-1 because of cache considerations and other stuff. If you really really need to fine-tune that integer divide, and your whole program depends on it, you will end up hard-coding the divisor to 2^n-1 and making bitwise & logic yourself.
Finally, this is a bit of a rant - a dedicated ALU unit for integer divides can really reduce the latency to around 6-8 cycles, it just takes up a relatively large die area because the data path ends up being about 128 bits wide and nobody has the real estate for it when integer DIVs work just fine how they are.