To begin with, the results will most likely differ for a JVM run in client or in server mode. Second of all, this number depends highly on the complexity of your code and I am afraid that you will have to exploratively estimate a number for each test case. In general, the more complex your byte code is, the more optimization can be applied to it and therefore your code must get relatively hotter in order to make the JVM reach deep into its tool box. A JVM might recompile a segment of code a dozen of times.
Furthermore, a "real world" compilation depends on the context in which your byte code is run. A compilation might for example occur when a monomorphic call site is promoted to a megamorphic one such that an observed compilation actually represents a de-optimization. Therefore, be careful when assuming that your micro benachmark reflects the code's actual performance.
Instead of the suggested flag, I suggest you to use CompilationMXBean
which allows you to check the amount of time a JVM still spends with compilation. If that time is too high, rerun your test until the value gets stable long enough. (Be patient!) Frameworks can help you with creating good benchmarks. Personally, I like caliper. However, never trust your benchmark.
From my experience, custom byte code performs best when you immitate the idioms of javac. To mention the one anecdote I can tell on this matter, I once wrote custom byte code for the Java source code equivalent of:
int[] array = {1, 2, 3};
javac creates the array and uses dup
for assigning each value but I stored the array reference in a local variable and loaded it back on the operand stack for assigning each value. The array had a bigger size than that and there was a noteable performance difference.
Finally, I recommend this article before writing a benchmark.