A typical modern CPU can execute a number of unrelated instructions (those that don't depend on the same resources) concurrently.
To do that, it typically ends up with a basic structure vaguely like this:

So, we have an instruction stream coming in on the left. We have three decoders, each of which can decode one instruction each clock cycle (but there may be limitations, so complex instructions all have to pass through one decoder, and the other two decoders can only do simple instructions).
From there, the instructions pass into a reorder buffer, which keeps a "scoreboard" of which resources are used by each instruction, and which resources are affected that instruction (where a "resource" would typically be something like a CPU register or a flag in the flags register).
The circuitry then compares those scoreboards to determine dependencies. For example, if one instruction writes to register 0, and a later one reads from register 0, then those instructions must execute serially. At each clock, it tries to find the N oldest instructions that don't have dependencies for execution.
There are then a number of independent execution units. Each of these is basically a "pure" function--it takes some inputs, carries out a specified transformation on it, and produces an output. This makes it easy to replicate them as needed, and have as many running in parallel as we want/can afford. Those are typically grouped, with one port going to each group. In each clock, we can send one instruction through that port to one of the execution units in that group. Once an instruction arrives at the execution unit, it may take more than one clock to finish execution.
Once those execute, we have a set of retirement units that take the results, and write them back to the registers in execution order. Again we have multiple units so we can retire multiple instructions per clock.
Note: this drawing tries to be semi-realistic about the rough number of decoders, retirement units, and ports that it depicts, but what it shows is a general idea--different CPUs will have more or fewer specific resources. For almost any of them, the number of decoded instructions in the scoreboard units is low though--a realistic number would be more like 50 instructions.
In any case, actual execution of instructions is one of the hardest parts of this to measure or reason about. The number of ports gives us a hard upper limit on the number of instructions that can start executing in any given clock. The number of decoders and retirement units give an upper limit on the number of instructions that can be started/finished per clock. The execution itself...well, there are a lot of execution units, and each one (at least potentially) takes a different number of clocks to execute an instruction.
With the design as shown above, you'd have a hard upper limit of three instructions per clock. That's the most you can decode or retire. With a different design, that could obviously go up or down (e.g., with 4 decoders, 4 ports and 4 retirement units, the upper limit could go up to 4).
Realistically, with that design you wouldn't normally expect to see three instructions execute in most clock cycles. There are enough dependencies between instructions that you'd probably expect closer to 2 as a long term average (and much more likely a little less than 2). Increasing the available resources (more decoders, more retirement units, etc.) will rarely help that a whole lot--you might get to an average of three instructions per clock, but hoping for four is probably unrealistic.