Here's a way of stretching a 16-bit mask into 64 bits where every bit represents 4 bits of stretched mask:
uint64_t x = 0x000000000000CF00LL;
x = (x | (x << 24)) & 0x000000ff000000ffLL;
x = (x | (x << 12)) & 0x000f000f000f000fLL;
x = (x | (x << 6)) & 0x0303030303030303LL;
x = (x | (x << 3)) & 0x1111111111111111LL;
x |= x << 1;
x |= x << 2;
It starts of with the mask in the bottom 16 bits. Then it moves the top 8 bits of the mask into the top 32 bits, like this:
0000000000000000 0000000000000000 0000000000000000 ABCDEFGHIJKLMNOP
becomes
0000000000000000 00000000ABCDEFGH 0000000000000000 00000000IJKLMNOP
Then it solves the similar problem of stretching a mask from the bottom 8 bits of a 32 bit word, to the top and bottom 32-bits simultaneously:
000000000000ABCD 000000000000EFGH 000000000000IJKL 000000000000MNOP
Then it does it for 4 bits inside 16 and so on until the bits are spread out:
000A000B000C000D 000E000F000G000H 000I000J000K000L 000M000N000O000P
Then it "smears" them across 4 bits by ORing the result with itself twice:
AAAABBBBCCCCDDDD EEEEFFFFGGGGHHHH IIIIJJJJKKKKLLLL MMMMNNNNOOOOPPPP
You could extend this to 128 bits by adding an extra first step where you shift by 48 bits and mask with a 128-bit constant:
x = (x | (x << 48)) & 0x000000000000ffff000000000000ffffLLL;
You'd also have to stretch the other constants out to 128 bits just by repeating the bit patterns. However (as far as I know) there is no way to declare a 128-bit constant in C++, but perhaps you could do it with macros or something (see this question). You could also make a 128-bit version just by using the 64-bit version on the top and bottom 16 bits separately.
If loading the masking constants turns out to be a difficulty or bottleneck you can generate each one from the previous one using shifting and masking:
uint64_t m = 0x000000ff000000ffLL;
m &= m >> 4; m |= m << 16; // gives 0x000f000f000f000fLL
m &= m >> 2; m |= m << 8; // gives 0x0303030303030303LL
m &= m >> 1; m |= m << 4; // gives 0x1111111111111111LL