Ah, a very deep topic indeed!
Cache coherency between cores is used to synthesise (as closely as possible) and Symetric Multi Processing (SMP) environment. This harks back to the days when multiple single core CPUs were simply tagged on to the same single memory bus, circa mid 1990s, caches weren't really a thing, etc. With multiple CPUs with multiple cores each with multiple caches and multiple memory interfaces per CPU, the synthesis of an SMP-like environment is a lot more complicated, and cache-coherency is a big part of that.
So, when one asks, "Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away?", one is really asking "Do we still need an SMP environment?".
The answer is software. An awful lot of software, including all major OSes, has been written around the assumption that they're running on an SMP environment. Take away the SMP, and we'd have to re-write literally everything.
There are now various sage commentators beginning to wonder in articles whether SMP is in fact a dead end, and that we should start worrying about how to get out of that dead end. I think that it won't happen for a good long while yet; the CPU manufacturers have likely got a few more tricks to play to get ever increasing performance, and whilst that keeps being delivered no one will want to suffer the pain of software incompatibility. Security is another reason to avoid SMP - Meltdown and Spectre exploit weaknesses in the way SMP has been synthesised - but I'd guess that whilst other mitigations (however distasteful) are available security alone will not be sufficient reason to ditch SMP.
"Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?" Why not, indeed? We have been there before. Transputers (1980s, early 1990s) implemented Communicating Sequential Processes (CSP), where if the application needed a different CPU to process some data, the application would have to purposefully transfer data to that CPU. The transfers are (in CSP speak) through "Channels", which are more like network sockets or IPC pipes and not at all like shared memory spaces.
CSP is having something of a resurgence - as a multiprocessing paradigm it has some very beneficial features - and languages such as Go, Rust, Erlang implement it. The thing about those languages' implementations of CSP is that they're having to synthesise CSP on top of an SMP environment, which in turn is synthesised on top of an electronic architecture much more reminiscent of Transputers!
Having had a lot of experience with CSP, my view is that every multi-process piece of software should use CSP; it's a lot more reliable. The "performance hit" of "copying" data (which is what you have to do to do CSP properly on top of SMP) isn't so bad; it's about the same amount of traffic over the cache-coherency connections to copy data from one CPU to another as it is to access the data in an SMP-like way.
Rust is very interesting, because with it's syntax strongly expressing data ownership I suspect that it doesn't have to copy data to implement CSP, it can transfer ownership between threads (processes). Thus it may be getting the benefits of CSP, but without having to copy the data. Therefore it could be very efficient CSP, even if every thread is running on a CPU single core. I've not yet explored Rust deeply enough to know that that is what it's doing, but I have hopes.
On of the nice things about CSP is that with Channels being like network sockets or IPC pipes, one can readily implement CSP across actual network sockets. Raw sockets are not in themselves ideal - they're asynchronous and so more akin to Actor Model (as is ZeroMQ). Actor Model is fairly OK - and I've used it - but it's not as guarateed devoid of runtime problems as CSP is. So one has to implement the CSP bit oneself or find a library. However, with that in place CSP becomes a software architecture that can more easily span arbitrary networks of computers without having to change the software architecture; a local channel and a network channel are "the same", except the network one is a bit slower.
It's a lot harder to take a multithreaded piece of software that assumes SMP, uses semaphores, etc to scale up across multiple machines on a network. In fact, it can't, and has to be re-written.
More recently than Transputers, the Cell processor (Playstation 3 fame) was a multi-core device that did exactly as you suggest. It had a single CPU core, and 8 SPE maths cores each with 255k on-chip core-speed static RAM. To use the SPEs you had to write software to ships code and data in and out of that 256k (there was a monster-fast internal ring bus for doing this, and a very fast external memory interface). The result was that, with the right developer, very good results could be attained.
It took Intel about a further 10 years to usefully get x64 up to about the same performance; adding in a Fused Multply-Add instruction into SSE was what finally got them there, an instruction they'd been keeping in Itanium's repetoire in the vain hope of boosting its appeal. Cell (the SPEs were based in the PowerPC equivalent of SSE - Altivec) had had an FMA instruction from the get-go.