There's far too many CPUs out there to make a general assumption about this, but:
If you're, let's say, on a common x86 architecture, then what the cache will contain are always multiples of a cache line size, containing the first address you accessed that led to a cache miss;that is the same for forward access.
Depending on how complicated memory access prediction is, the backward access might also be prefetched; who does that prediction depends on your CPU architecture, your actual CPU implementation, and your compiler. It's not uncommon for compilers to "know" which memory access patterns work well for a given CPU generation and make sure that memory access happens in that order.
For your very arithmetic case, there might even be e.g. automatic detection of four consecutive, aligned addresses being accessed, and automatic vectorization with the SIMD instructions your CPU support. That also has an effect on the alignment with with the RAM is accessed, which might have even further influence on cache behaviour..
Furthermore, since you seem to care about speed, you'd typically allow your compiler to optimize. In very many cases, this would lead to such loops becoming "reversed", and SIMD'ed, even.
Note that for other architectures, this might work differently: For example, there's an infamous family of motorola DSPs of the mid-90s that had a relatively simple address generation unit, and things like accessing memory backwards was possible fast if you (or your C compiler) knew how to tell it to work backwards; then, there was the option to "fuse" a memory load or store with any other CPU instruction, so here your whole caching would effectively be dominated by how you manually specified the memory access patterns.