They don't actually have to propagate, the cache coherency model (MESI or something more advanced) provides you a guarantee that the memory behaves coherently, almost as if it's flat and no cached copies exist. Sequential consistency adds to that a guarantee of the same observation order by all agents in the system (notice - most CPUs don't provide sequential consistency through HW alone).
If a thread does a memory write (not even atomic), the core it runs on will fetch the line and obtain ownership over it. Once the write is done, any thread that attempts to observe the line is guaranteed to see the updated value, even if the line still resides in the modifying core - usually this is achieved through snooping the core and getting the line from it as a response. The cache coherency protocols will guarantee that if such a modification exists locally in some core - any other core looking for that line is bound to see it eventually. To do that, the CPU my use snoop filters, directory management (often for cross socket coherency), or other methods.
Now, you're asking why is atomic important? For 2 reasons. First - all the above applies only if the variable resides in memory, and not a register. This is a compiler decision, so the correct type tells it to do so. Other paradigms (like open-MP or POSIX threads) have other ways to tell the compiler that a variable needs to be shared through memory.
Second - modern cores execute operations out-of-order, and we don't want any other operation to pass that write and expose stale data. std::atomic tells the compiler to enforce the strongest memory ordering (through the use of explicit fencing or locking - check out the generated assembly code), which means that all your memory operations from all threads will be have the same global ordering. If you didn't do that, strange things can happen like core A and core B disagreeing on the order of 2 writes to the same location (meaning that they may see different final values in it).
Last, of course, is the actual atomicity - if your data type is not one that has atomicity guaranteed, or it's not properly aligned - this will also solve that problem for you (otherwise the coherency problem intensifies - think of some thread trying to change a value split between 2 cache lines, and different cores seeing partial values)