The problem can be described as follow.
Input
__m256d a, b, c, d
Output
__m256d s = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3],
c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}
Work I have done so far
It seemed easy enough: two VHADD with some shuffling in-between but in fact combining all permutations featured by AVX can't generate the very permutation needed to achieve that goal. Let me explain:
VHADD x, a, b => x = {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
VHADD y, c, d => y = {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
Were I able to permute x and y in the same manner to get
x1 = {a[0]+a[1], a[2]+a[3], c[0]+c[1], c[2]+c[3]}
y1 = {b[0]+b[1], b[2]+b[3], d[0]+d[1], d[2]+d[3]}
then
VHADD s, x1, y1 => s1 = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3],
c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}
which is the result I wanted.
Thus I just need to find how to perform
x,y => {x[0], x[2], y[0], y[2]}, {x[1], x[3], y[1], y[3]}
Unfortunately I came to the conclusion that this is provably impossible using any combination of VSHUFPD, VBLENDPD, VPERMILPD, VPERM2F128, VUNPCKHPD, VUNPCKLPD. The crux of the matter is that it is impossible to swap u[1] and u[2] in an instance u of __m256d.
Question
Is this really a dead end? Or have I missed a permutation instruction?