This doesn't specifically answer your question since you are using doubles, but here is the code I used to get the maximum of 8 single values. It is built off of the answer by @celion and @Norbert P.
#define HORIZONTAL_MAX_256(ymmA, result) \
/* [upper | lower] */ \
/* [7 6 5 4 | 3 2 1 0] */ \
__m256 v1 = ymmA; /* v1 = [H G F E | D C B A] */ \
__m256 v2 = _mm256_permute_ps(v1, 0b10'11'00'01); /* v2 = [G H E F | C D A B] */ \
__m256 v3 = _mm256_max_ps(v1, v2); /* v3 = [W=max(G,H) W=max(G,H) Z=max(E,F) Z=max(E,F) | Y=max(C,D) Y=max(C,D) X=max(A,B) X=max(A,B)] */ \
/* v3 = [W W Z Z | Y Y X X] */ \
__m256 v4 = _mm256_permute_ps(v3, 0b00'00'10'10); /* v4 = [Z Z W W | X X Y Y] */ \
__m256 v5 = _mm256_max_ps(v3, v4); /* v5 = [J=max(Z,W) J=max(Z,W) J=max(Z,W) J=max(Z,W) | I=max(X,Y) I=max(X,Y) I=max(X,Y) I=max(X,Y)] */ \
/* v5 = [J J J J | I I I I] */ \
__m128 v6 = _mm256_extractf128_ps(v5, 1); /* v6 = [- - - - | J J J J] */ \
__m128 v7 = _mm_max_ps(_mm256_castps256_ps128(v5), v6); /* v7 = [- - - - | M=max(I,J) M=max(I,J) M=max(I,J) M=max(I,J)] */ \
/* v7 = [- - - - | M M M M] */ \
/* M = max(I,J) */ \
/* M = max(max(X,Y),max(Z,W)) */ \
/* M = max(max(max(A,B),max(C,D)),max(max(E,F),max(G,H))) */ \
_mm_store_ss(&result, v7);
edit
Using the VCL2 (Vector Class Library 2) lib, it seems to produce assembly code that is similar to what Peter Cordes is talking about in the comments. Here is the assembly code that VCL2 generated for my project:
vextractf128 xmm1, ymm0, 1 # ymm0 is the register to find the min of
vmaxps xmm1, xmm0, xmm1
vpermilpd xmm2, xmm1, 3
vmaxps xmm1, xmm2, xmm1
vpsrldq xmm2, xmm1, 4
vmaxps xmm1, xmm2, xmm1
vbroadcastss ymm2, xmm1 # This would be the save, it is setting up for the next instructions specific to my code