My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I strongly recommend against exclusive heuristic-based thread-to-core guessing, for several reasons:
Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).
There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.
Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding@Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.
Regarding SMT (HyperThreading, Compute Modules):
Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.
Regarding Turbo Features:
Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).
... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.
Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.