Threading Building Blocks (TBB) is a templated C++ library for task parallelism. The library contains various algorithms and data structures specialized for task parallelism. I have had success with using parallel_for as well as parallel_pipeline to greatly speed up computations. With a little bit of extra coding, TBB's parallel_for can take a serial for loop that is appropriate for being executed in parallel and make it execute as such (See example here). TBB's parallel_pipeline has the ability to execute a chain of dependent tasks with the option of each being executed in parallel or serial (See example here). There are many more examples on the web especially at software.intel.com and here on stackoverflow (see here).
OpenMP is an API for thread parallelism that is accessed primarily through compiler directives. Although, I prefer to use the richer feature set provided by TBB, OpenMP can be a quick way of testing out parallel algorithms and code (just add a pragma and set some build settings). Once things have been tested and experimented with, I have found that converting certain uses of OpenMP to TBB can be done fairly easily. This isn't to say that OpenMP is not meant for serious coding. In fact, there may be instances in which one would prefer OpenMP over TBB (One is that because it primarily relies on pragmas, switching to serial execution can be easier than with TBB.). A number of open source projects that utilize OpenMP can be found in this discussion. There are a number of examples (e.g., on wikipedia) and tutorials on the web for OpenMP including many questions here on stackoverflow.
I previously neglected a discussion on SIMD (single instruction, multiple data), which provides data parallelism. As pointed out in the below comments, OpenMP is an option for exploring SIMD (check this link). Extensions to instruction sets such as SSE and AVX (both extensions to the x86 instruction set architecture) as well as NEON (ARM architecture) are also worthwhile to explore. I have had good and bad experience with using SSE and AVX. The good is that they can provide a nice speed up to certain algorithms (in particular I have used Intel intrinsics). The bad is that the ability to use these instructions is dependent upon specific CPU support, which may cause unexpected runtime exceptions.
Specifically with respect to parallelism and mathematics, I have had good experiences using Intel MKL (which now has a no cost option) as well as OpenBLAS. These libraries provide optimized, parallel, and/or vectorized implementations of common mathematical functions/routines (e.g., BLAS and LAPACK). There are many more libraries available that deal specifically with mathematics out there that involve optimized parallelism to some extent. While they may not provide lower level building blocks of parallelism (e.g., ability to manipulate threads, schedule tasks), it is very worthwhile to utilize (and contribute to) the immense amount of research and work in the field of computational mathematics. A similar statement could be said for areas of interest outside of mathematics.