Which winsorize is more accurate, Python or R

Question

I am trying to implement a winsorization function but get confused by the exact definition of it. Obviously, the winsorize function in R package, DescTool, and the winsorize function in Python library, scipy.stats.mstats, yield different results. I am a little surprised by this as both functions are very popular but nobody seems to care about the difference. Here is a simple test:

In R

library(DescTools)
data <- seq(0, 99)
Winsorize(data, probs=c(0.025, 1-0.025))

The result is [2.475, 2.475, 2.475, 3., 4., 5., 6., ..., 96., 96.525, 96.525, 96.525].

However, in Python,

import numpy as np
from scipy.stats.mstats import winsorize

data = np.arange(100).astype(np.float)
new_data = winsorize(data, [0.025, 0.025])
new_data

The result is [2., 2., 2., 3., 4., 5., 6., ..., 96., 97., 97. ,97.].

What makes it even worse is that based on Wikipedia's example, it should be [3., 3., 3., 3., 4., 5., 6., ..., 96., 96., 96. ,96.] because the 2.5th percentile is 2.475, which fells between 2 and 3 and therefore, everything less than 2.475 should round to 3.

Does anybody know which version I should implement?

Thanks

I'm not familiar with `DescTools` or `Winsorize`, does it make a difference that R's data here is integer, and python's data is real? — r2evans, Dec 05 '19 at 22:02
Thanks @r2evans, but this should not be the issue. It is `R` that generates a floating result. I believe R auto-convert the type when necessary. Actually, if we add an `as.double()` around the `seq`, we still get the same result. — Bob, Dec 06 '19 at 13:39

IceCreamToucan · Accepted Answer · 2019-12-05T22:43:29.960

It seems to be a difference in how the quantile is defined. R uses a continuous quantile function by default, which is described in ?quantile's list of 9 types of quantiles under "Type 7". If you use type = 1 in DescTools::Winsorize, the results seem to match winsorize from scipy.stats.mstats (just based on the output shown in the question).

library(DescTools)
data <- seq(0, 99)
Winsorize(data, probs=c(0.025, 1-0.025), type = 1)
#   [1]  2  2  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#  [34] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
#  [67] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 97
# [100] 97

None of the 9 methods produce the output shown on the Wikipedia page for that example. There's no citation there though so I wouldn't put too much thought into it.

Thanks, that makes sense. I did not realize that the definition of quantile is not unified. — Bob, Dec 06 '19 at 13:32
You mean that something on wiki is unverified and possibly incorrect? Incroyable! (I laugh a little when technical papers use something on wikipedia as a primary/sole reference on facts.) — r2evans, Dec 06 '19 at 13:47

Which winsorize is more accurate, Python or R

1 Answers1