The whole point of caching is to allow a lot of highly localized memory operations to happen quickly.
The fastest operations involve registers, of course. The only delay involved in using them is in the instruction fetching, decoding, and execution. In some register rich architectures (and in vector processors), they're actually used like a specialized cache. And all but the slowest processors have one or more levels of cache that looks like memory to ordinary instructions, except faster.
To simplify relative to actual processors, consider a hypothetical processor that runs at 2 GHz (0.5 ns per clock), with memory that takes 5 ns to load an arbitrary 64 bit (8 byte) word of memory, but only 1 ns to load each successive 64 bit word from memory. (Assume also that writes are similar.) On such a machine, flipping a bit in memory is pretty slow: 1 ns to load the instruction (only if it's not already in the pipeline – but 5 ns after a distant branch), 5 ns to load the word containing the bit, 0.5 ns to execute the instruction, and 5 ns to write the changed word back to memory. A memory copy is better: approximately zero to load instructions (since the pipeline presumably does the right thing with loops of instructions), 5 ns to load the first 8 bytes, 0.5 ns to execute an instruction, 5 ns to store the first 8 bytes, and 1+0.5+1 ns for each additional 8 bytes. Locality makes things easier. But some operations can be pathological: incrementing each byte of an array does the initial 5 ns load, the 0.5 ns instruction, the initial 5 ns store, then 1+0.5+1 per byte (rather than per word) thereafter. (A memory copy that doesn't fall on the same word boundaries is also bad news.)
To make this processor faster, we can add a cache that improves loads and stores to just 0.5 ns over the instruction execution time, for data that's in cache. The memory copy doesn't improve on reading, since it still costs 5 ns for the first 8 byte work and 1 ns for additional words, but the writes get much faster: 0.5 ns for every word until the cache fills, and at the normal 5+1+1, etc. rate after it fills, in parallel with other work that uses memory less. The byte increments improve to 5 ns for the initial load, 0.5+0.5 ns for the instruction and write, then 0.5+0.5+0.5 ns for each additional byte, except during cache stalls on read or write. More repetition of the same few addresses increase the proportion of cache hits.
What happens with real processors, multiple levels of cache, etc.? The simple answer is that things get more complicated. Writing cache-aware code includes trying to improve locality of memory access, analysis to avoid thrashing the cache, and a whole lot of profiling.