If you want to generate unsigned long long primes using a version of get_primes() then you are in for a very long wait.
For generating primes in the range lo ... hi (inclusive) you need to consider only factors up to sqrt(hi). Hence you need a small sieve (up to 32 bits) for factors and another small sieve of size (hi - lo + 1) for sieving the target range.
Here's an abbreviated version of a sieve that runs up to 2^64 - 1; it uses a full sieve instead of sieving only odd numbers, because it is the reference code I use to verify optimised implementations. The changes for that are straightforward but add even more pitfalls to the whole shebang. As it is it sieves the 10 million primes between 999560010209 and 999836351599 in about 3 seconds, and those between 18446744073265777349u and 18446744073709551557u (i.e. just below 2^64) in about 20 seconds
The factor sieve is global because it gets reused a lot, and sieving the factors can take a while too. I.e. prepping the factors for a range close to 2^64 means sieving all (or most) of the factors up to 2^32 - 1, and thus it can easily take up to 10 seconds.
I wrapped my bitmap code (moral equivalent to std::bitset<>) and the factor sieve into classes; using raw vectors would make the code inflexible and unreadable. I shortened my code, remove a lot of asserts and other noise, and substituted calls to external functions with inlined code (like the call to std::sqrt()), for the sake of exposition. That way you can cull answers like what to do with the offset (here called lo) directly from verified working code.
The point of having separate number_t and index_t is that number_t can be unsigned long long but index_t must be uint32_t for my current infrastructure. The member functions of bitmap_t use the name of the underlying CPU instructions. BTS ... bit test and set, BT ... bit test. Bitmaps are initialised to 0 and a set bit signifies non-prime.
typedef uint32_t index_t;
sieve_t g_factor_sieve;
template<typename number_t, typename OutputIterator>
index_t generate_primes (number_t lo, number_t hi, OutputIterator sink)
{
// ...
index_t max_factor = index_t(std::sqrt(double(hi)));
g_factor_sieve.extend_to_cover(max_factor);
number_t range = hi - lo; assert( range <= index_t(index_t(0) - 1) );
index_t range32 = index_t(range);
bitmap_t bm(range32);
if (lo < 2) bm.bts(1 - index_t(lo)); // 1 is not a prime
if (lo == 0) bm.bts(0); // 0 is not a prime
for (index_t n = 2; n <= max_factor && n > 1; n += 1 + (n & 1))
{
if (g_factor_sieve.not_prime(n)) continue;
number_t start = square(number_t(n));
index_t stride = n << (int(n) > 2 ? 1 : 0); // double stride for n > 2
if (start >= lo)
start -= lo;
else
start = (stride - (lo - start) % stride) % stride;
// double test because of the possibility of wrapping
for (index_t i = index_t(start); i <= bm.max_bit; )
{
bm.bts(i);
if ((i += stride) < stride)
{
break;
}
}
}
// output
for (index_t i = 0; ; ++i)
{
if (!bm.bt(i))
{
*sink = lo + i;
++sink;
++n;
}
if (i >= bm.max_bit) break;
}
return n;
}