On the whole, the base strategy for picking a pivot should be to pick a random element. The cost of this strategy is, of course, depending on the cost of generating random numbers (one per partition, so on average O(N) times), but unlike any strategy involving picking pivots from fixed positions, the stochastic strategy is not open to DoS attacks (provided that the random number generator is not open to prediction attacks).
Always choosing the first (or last) element of the partition is not only open to DoS attacks; it also results in worst-case behaviour for already sorted data. That's certainly not acceptable in any normal environment where sorted or partially-sorted inputs can be expected, so it should be rejected off the top.
Median-of-3 (or Median-of-k) strategies where the objects whose median is computed are at fixed positions are subject to DoS attacks. In environments where this is possible, the strategy can be improved by randomly selecting one or more values to participate in the median computation. I believe (although I don't propose to provide any proof of the conjecture) that it is sufficient to select the "middle" value at random, although there might be some tiny benefit to selecting all three values at random. The additional cost is potentially significant, though: two (more) random numbers, and two additional swaps. (Substitute k−1 for 2 if you are using median-of-k.)
The following comparison is based on the above assumptions -- that we will reject the strategy of taking the pivot to be the first element, and that we can do median-of-3 with the median of the first, last and a single randomly-selected element. Since both of these choose one random number per partition, I'll ignore the cost of generating random numbers. I will also use n consistently as the size of the partition, and assume that it is, at least, 4 so that there is no issue with finding three elements for the median. (Generally, real-world QS implementations use a different sort algorithm for small partitions; I think it is fair to assume that for partitions of sizes less than some small number, there are hard-coded sorting functions.)
In the stochastic-pivot strategy, we proceed as follows:
Generate a random index r in the half-open range [0, n).
Swap elements 0 and r. Element 0 will be the pivot value.
Partition the remaining n−1 elements, which involves comparing every one of them with the pivot value once, for a total of n−1 comparisons. The number of swaps is harder to estimate (and varies with partition algorithm) but it is certainly O(n).
In the stochastic-median-of-three strategy, we will first arrange for the elements at positions 0, 1 and n−1 to be sorted in order. So the procedure will be:
Generate a random index r in the half-open range [1, n−1)
Compare elements at locations 0 and n−1; swap them if necessary to put them in order. (1 compare; worst-case 1 swap)
Compare element 0 with element r. If element r is smaller, rotate elements 0, 1 and r. Otherwise, compare element r with element n−1 and if element r is bigger, rotate elements n-1, 1 and r. (worse-case: 2 compares and 1 rotate.) The pivot value is now the value at position 1.
Use the same partitioning algorithm, but this time involving n−3 elements, since elements 0 and n−1 are already correctly placed. This involves n−3 comparisons and O(n) swaps.
Now, to compare the two strategies.
The cost of step 1 is identical in the two cases, since the cost of generating a random number does not depend on the range of possibilities (at least, given these two possible ranges).
The cost of step 2 in the first strategy is zero compares and one swap. The cost of steps 2 and 3 combined in the second strategy is two or three compares (expected 8/3), zero or one swaps (expected 1/2) and zero or one rotate (expected 2/3). The difference in worst cases is three compares and one rotate; the difference in expected cost is 8/3 compares and 2/3 of a rotate less 1/2 of a swap.
(Parenthetically, a swap is three mutations -- t = a; a = b; b = t
-- while a rotate is four mutations -- t = a; a = b; b = c; c = t
. If we approximate a rotate as 4/3 of a swap, the worst case difference is 4/3 of a swap, and the expected case is 7/18 of a swap.)
The difference between step 3 of the first strategy and step 4 of the second strategy is in the opposite direction: stochastic median-of-3 uses 2 fewer compares and O(1) fewer swaps.
So, adding all that up, we can estimate the per-partition cost of stochastic median-of-3 to be at worst one compare, and on average 2/3 of a compare. (I'm not entirely sure how to compare the swap count, but I think stochastic-median-of-3 comes out ahead in the worst case and approximately equal in the average case.)
Now, the question is: does stochastic median-of-3 provide enough benefits to outweigh 2/3 of a compare per partition? It certainly wouldn't take much benefit to justify that cost, but I'm a little sceptical that it provides even that little.
If we were to use median-of-5 instead of median-of-3, then the cost/benefit is more skewed. Median-of-5 saves an additional two compares in the partitioning phase (since the remaining subarray is now n−5) but the cost of computing the median is somewhat higher. It is not as high as suggested by the OP, however: it is possible to arrange five elements in sequence with worst-case seven comparisons (the expected number is only very slightly smaller than seven) using a well-known algorithm described in this SO question. But even with the cost being seven (instead of the average 11.17/worst-case 17 suggested in the OP), it is still quite a bit larger than the saving of four comparisons in the partitioning phase.
So I conclude that stochastic median-of-3 might be slightly better, but I suspect not enough to justify the additional code complexity, but stochastic median-of-5 definitely is not.