See Agner Fog's microarch guide for pipeline details like this. https://www.realworldtech.com/haswell-cpu/ and also a uarch deep-dive on Haswell with block diagrams. (And links to some of David Kanter's articles on other uarches, like SnB and Core2, and AMD Bulldozer and K8.) Also other links in https://stackoverflow.com/tags/x86/info
Yes, modern x86 cores are superscalar out-of-order execution. The fundamentals haven't changed since PPro: decode x86 machine code into micro-ops (uops) that can be scheduled by a ROB + RS.
(Terminology: Intel uses "issue" to mean "copy into the out-of-order back-end", and "dispatch" to mean "send from the scheduler to an execution unit", allocating resources and updating the RAT. In a lot of the rest of the computer-architecture field, people use the opposite terminology.)
Intel since Core 2 is 4 uops wide superscalar in the issue/rename/allocate stage, the narrowest bottleneck. (Before that, PPro to Pentium-M, it was 3-wide.) Core 2 could rarely sustain that in practice, with too many other bottlenecks. Skylake often can come very close in high-throughput code.
To get more work into each fused-domain uop, there's micro-fusion of an ALU uop with a load of its memory-source. And macro-fusion of e.g. cmp/test + jcc so compare-and-branch instructions together decode as one uop. (See Agner Fog's microarch guide). This includes your Kaby or Coffee Lake CPU. The max unfused-domain sustained throughput is 7 uops per clock, achievable in practice on Skylake. In a burst, the scheduler can dispatch uops to every port.
Ice Lake (Sunny Cove uarch) widens the issue stage to 5.
AMD Zen's is 6 uops wide, but only 5 instructions wide, so it can only achieve 6 uops/clock when running at least some 2-uop instructions. e.g. 256-bit AVX SIMD instructions which it decodes to 2x 128-bit halves (or worse for lane-crossing shuffles).
Skylake widened the legacy decoders to 5 uops/clock, and uop cache fetch to 6 uops / clock, up from 4/clock in SnB through Broadwell. This hides front-end bubbles more of the time and keeps the issue/rename stage fed with 4 uops per clock more of the time in high-throughput code. (There are buffers / queues between stages, e.g. the 64 uop IDQ that feeds the issue/rename stage.)
This includes your Kaby or Coffee Lake CPU: microarchitecturally the IA cores in KBL are identical to SKL, and Coffee Lake is a very minor tweak (fixing the loop buffer which SKL had to disable in a microcode update because of a partial-register merging uop erratum, aka CPU bug). KBL and CFL have better GPUs than SKL but the x86 cores are basically the same.
Yes, there are diminishing returns beyond 3 or 4-wide for most code, but SMT lets a wide core find the ILP in two (or 4 or 8) threads of execution at once. That makes wider cores not be wasted, but the cost of a core scales more than linearly with width so you only do it if sometimes a single thread can use most of that width. Otherwise you'd just build more smaller cores. (At least if you have a scaleable interconnect for more cores...) My answer on Why not make one big CPU core? on electronics.SE has more details about the tradeoffs and the limited ILP available in real workloads.