Intel published a paper on SIMD-accelerating SHA512, in Nov 2012.
They say they got ~8.59 cycles/byte for their AVX version, on a Sandybridge i7 2600. They didn't publish results for their AVX2 / rorx
(BMI2) version, since Haswell wasn't released yet. I didn't follow the links to the source code; presumably it's C with intrinsics.
To implement it in Ruby's source code, you'll need to handle the case where ruby is running on a CPU that doesn't support the instruction set extensions your fast version uses, and fall back to a plain C or SSE2-only version.
Your best bet might be to have ruby use OpenSSL or a similar library to get hand-tuned versions of SHA-512 and many other functions. Crypto libraries already have with hand-tuned asm versions for many different platforms.
With Skylake (and Goldmont), Intel introduced new instructions to accelerate SHA-1 and SHA-256. Unfortunately, I don't see anything about being able to use those instructions for SHA-512.