Just simple busy loops will increase the CPU usage -- I am not aware if doing FP calculations will change this significantly or otherwise be able to achieve a more consistent load factor, even though it does exercise the FPU and not just the ALU.
While creating a similar proof-of-concept in C# I used a fixed number of threads and changed the sleep/word durations of each thread. Bear in mind that this process isn't exact and is subject to both CPU and process throttling as well as other factors of modern preemptive operating systems. Adding VMware to the mix may also compound the observed behaviors. In degenerate cases harmonics can form between the code designed to adjust the load and the load reported by the system.
If lower-level constructs were used (generally requiring "kernel mode" access) then a more consistent throttle could be implemented -- partially because of the ability to avoid certain [thread] preemptions :-)
Another alternative that may be looked into, with the appropriate hardware, is setting the CPU clock and then running at 100%. The current Intel Core-i line of chips is very flexible this way (the CPU multiplier can be set discreetly through the entire range), although access from Java may be problematic. This would be done in the host, of course, not VMware.
Happy coding.