Modern CPUs have coherent caches. A coherent cache is implemented by a cache coherence protocol. When a store wants to commit to the coherent cache, it first needs to wait for the cache line to be invalidated on all CPUs. Any other CPU that would access the cache after the cache line has been invalidated, will not see an older version.
Writing to a cache line is linearizable because the effect of writing to a cache line takes place between the start of the write and completion of the write (its linearization point is between these 2 events). Reading from the cache is linearizable as well since its linearization point is also between start and completion.
Because accessing a cache line is linearizable and linearizability is composable, the cache (the set of all cache lines) is linearizable. Because the cache is linearizable, there is guaranteed to be a total order over all reads and writes from the cache.
This total order might or might not be consistent with the program order; this is not a concern of the cache. E.g. out-of-order execution or committing stores from the store buffer, into the cache, out of order might break program order. But at least nobody will be able to disagree on the order in which loads/stores are done on the cache.
The problems that prevent having a total order on the memory order are caused by the layers above the cache. For example, the store buffer of one CPU might be shared with a strict subset of the CPUs (e.g. PowerPC). This way some CPUs can see stores early and others can't. Another cause is STLF (Store To Load Forwarding, aka store forwarding).
How does TSO/STLF fit into the picture:
Because older stores can be reordered with newer loads to different addresses, due to store buffers, we need to give up the [StoreLoad] ordering within memory order. But this doesn't mean we can't have a total order over all loads/stores; it just means we need to relax the program order restrictions within the memory order.
The tricky part is STLF. Because with STLF there can be disagreement on how a load is ordered with respect to the store it reads from. Example:
int a=0, b=0
CPU1
a=1 (1)
r1=a (2)
r2=b (3)
CPU2:
b=1 (4)
r3=b (5)
r4=a (6)
Condition:
r1=1,r2=0,r3=1,r4=0
The condition is possible with TSO. The load 'r1=a' can only be ordered after the store 'a=1' (a load can't be ordered in the memory order before its store it reads from).
But 'r1=a' is also ordered before 'a=1' due to STLF: it takes its value from the store buffer, before 'a=1' is inserted into the memory order.
So we end up with a cycle in the memory order. To resolve this problem, the load 'r1=a' and the store 'a=1' are not ordered by the memory order. As a consequence, we are just left with a total order over the stores.
The book "A primer on Memory Consistency and Cache Coherence" is a superb book. A lot of my understanding on this topic comes from this book and the linearizability argument and dealing with the STLF cycle from this book (I also had a few message exchanges with one of the authors).