This is a typical case of "In theory, theory and practice is the same, in practice they are not".
Modern CPU's have very sophisticated logic in them, which means that the ACTUAL number of operations performed is different from what you'd think from just looking at code or thinking about the problem [unless you have a brain the size of a small planet and know how that particular CPU works]. For example, a processor may speculatively execute instructions on one or another side of a branch, even if it hasn't quite got to the branch - if that's the "wrong" side, then it will discard the results of those instructions - but of course it took time to execute them.
Instructions are also executed out of order, which means that it's hard to predict exactly which instruction will execute when. There are some exceptions.
You will only get (anywhere near) the theoretical throughput if you are pushing data and instructions through all the available execution units at once - this means having the right mix of instructions, and of course ALL of the code and data in caches.
So, in theory we could stuff the processor full of instructions that maxes it out, by writing very clever code. In practice, that turns very very very quickly into a hard task.
However, the question is about measuring the throughput of instructions, and on modern CPU's, this is very much possible with the right extra software. On linux perftool or oprofile, for windows there is Intel's VTune and AMD's Code Analyst. These will allow you (subject to sufficient privileges) to fetch the "performance counters" in the processor, which has counters for "number of instructions", "number of float operatons", "number of cache misses", "branch mispredicted" and many, many other measurements of processor performance. So given a sufficient length of runtime (at least a few seconds, preferrably more), you can measure the actual count or clock cycles that a processor performs.