Further reading: see the end of this answer for links.
Cache lines are 64B in size, and aligned on 64B boundaries. (Some CPUs may use different size cache lines, but 64B is a very common choice).
Caches load data for two reasons:
demand misses: a load or store instruction accessed a byte that wasn't in any of the currently-hot cache lines.
Near-future accesses to the same byte (or int or whatever) will hit in cache (temporal locality). Near-future access to nearby bytes in the same cache line will also hit (spatial locality). Looping over a multi-dimensional array in the wrong order, or looping over an array of structs accessing only one member, is really bad because the whole cache line has to be loaded but you only use a small part of it.
prefetching: after a couple sequential accesses, the hardware prefetcher notices this pattern and starts loading caches lines that haven't been accessed yet, so hopefully there won't be a cache miss when the program does access it. For a specific example of HW prefetcher behaviour, Intel's optimization manual describes the HW prefetchers in various specific CPUs (see the x86 tag wiki for links), and points out that they only operate when the memory system isn't already flooded with demand misses.
There's also software prefetching: Software can run an instruction that tells the CPU it will access something soon. But the program keeps running even if the memory wasn't ready yet, because it's only a hint. The program doesn't get stuck waiting for a cache miss. Modern hardware prefetching is very good, and software-prefetching is usually a waste of time. It can be useful for something like a binary-search, where you can prefetch both the 1/4 and 3/4 locations before looking at the new middle.
Do not add software prefetching without testing that it actually speeds up your real code. Even if HW prefetching doesn't do a good job, SW prefetching may not help, or might hurt. Out-of-order execution in general does a lot to hide cache-miss latency.
Usually a cache is "full", and loading a new line requires discarding an old line. Caches usually implement an LRU replacement policy by ordering the tags within a set, so every access to a cache line moves it to the most-recently-used position.
Take an example of an 8-way associative cache:
A 64B chunk of memory can be cached by any one of 8 "ways" in the set that it maps to. Part of the address bits are used as an "index" to select a set of tags. (See this question for an example of splitting an address into tag | index | offset-within-cache-line
. The OP is confused about how it works, but has a useful ASCII diagram.)
Hit/miss determination doesn't depend on order. A fast cache (like an L1 cache) will typically check all 8 tags in parallel to find the one (if any) that matches the high bits of the address.
When we need space for a new line, we need to choose one of the 8 current tags to replace (and put the data in its associated 64B storage array). If any are currently in the Invalid state (not caching anything), then the choice is obvious. In the normal case, all 8 tags are already valid.
But we can store additional data with the tags, enough to store an ordering. Every time a cache hit happens, the ordering for the tags within the set is updated to put the line with the hit in the MRU position.
When a new line needs to be allocated, evict the LRU tag, and insert the new line at the MRU position.
The standard LRU policy means that looping over an array that's just slightly too large to fit in cache means you will never see any cache hits, because by the time you come back to the same address, it's been evicted. Some CPUs use a sophisticated replacement policy to avoid this: e.g. Intel IvyBridge's large shared L3 cache uses an adaptive replacement policy that decides on the fly when to allocate new lines in the LRU position, so new allocations evict other recently-allocated lines, preserving lines that do have future value. This takes extra logic, so is only done in large / slow caches, not in the faster L2 and L1 caches.
(Why all 8 tags are usually valid, even at the start of your program:
By the time execution reaches the start of your program, the kernel has already run a bunch of code on the same CPU that will be running your program. Typical modern caches are physically indexed and tagged (the VIPT L1 speed trick avoids aliasing, and is really equivalent to index translation being free, not really to using virtual indexing), so caches don't have to be flushed on context switches. i.e. they cache physical memory, regardless of changes in virtual-to-physical translation caused by page tables.)
You should read Ulrich Drepper's What Every Programmer Should Know About Memory article. IIRC, he goes through the basics of what triggers a load.
Some of Ulrich's specific advice is a bit dated by now; prefetching threads were useful on Pentium 4 but generally aren't anymore. HW prefetchers are smarter now, and Hyperthreading is good enough to just run two full threads.
Wikipedia's CPU Cache article also explains some details about caches, including eviction policy (how the CPU chooses which line to discard when loading a new line).