22

Why does chisq.test function in R sorts data before summation in descending order?

The code in question is:

STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))

If I was worried about numerical stability due to usage of float arithmetic and wanted to use some easy to deploy hack, I would sort the data in increasing order before summation to avoid adding a tiny value to large value in the accumulator (in order to avoid trimming of the least significant bits in the result as much as possible).

I looked into the source code of sum but it did not explain why to pass data in descending order to sum(). What do I miss?

An example:

x = matrix(1.1, 10001, 1)
x[1] = 10^16   # We have a vector with 10000*1.1 and 1*10^16
c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))

The result:

10000000000010996 10000000000011000

When we sort the data in the ascending order, we get the correct result. If we sort the data in the descending order, we get a result that is off by 4.

user824276
  • 617
  • 1
  • 7
  • 20
  • 3
    I found your question quite interesting so I dug a bit to find some references. It seems that summation in decreasing order can yield better precision in cases where there is heavy cancellation in the sum i.e the absolute value of the sum << the sum of absolute values. See the book [Accuracy and stability of numerical algorithms](http://ftp.demec.ufpr.br/CFD/bibliografia/Higham_2002_Accuracy%20and%20Stability%20of%20Numerical%20Algorithms.pdf) p.82-83. But here it is not the case as all the elements being summed are positive and therefore, a sum in increasing order should be more appropriate. – Lamia Dec 23 '17 at 23:29
  • @Lamia ; nice link. Seems bottom of page 82 mentions that if numbers are *"floating point number so large"*, decreasing might be best. Perhaps decision was made in case terms of (x - E)^2/E are large?? – user2957945 Dec 24 '17 at 12:33
  • 1
    @user2957945 I don't think that's the case, as the example you refer to, with a very large floating point number M, is the sum of {1,M,2M,-3M}, which is an example of heavy cancellation, whereas here all the numbers are positive. They also mention that for the summation of non-negative numbers, increasing order has better accuracy (p.82 and Table 4.1 p.89). So I'm not sure why there is decreasing order here... Maybe someone with more specific knowledge can weigh in? – Lamia Dec 24 '17 at 21:15
  • My best guess (and probably we'll never get to the bottom of this) is that this is a somewhat innocuous error caused by too many caffeine-fuelled late nights coding. The intention may have been to use sort to avoid losing LSBs but inadverently was the wrong way around. And the error was so innocuous that no-one noticed until now, when somone inquisitive and astute started trying to follow the source code. – dww Dec 28 '17 at 15:57
  • My guess is that R is a weakly typed pseudo language. – wildplasser Dec 29 '17 at 00:26

1 Answers1

7

EDIT: The book "Accuracy and stability of numerical algorithms by Nicolas J. Higham" states that

"when summing nonnegative numbers by recursive summation the increasing ordering is the best ordering, in the sense of having the smallest a priori forward error bound."

Thanks @Lamia for sharing the book in the comment section.

This books explains three methods of summation such as recursive, insertion and pairwise techniques. Each technique has its own merits and demerits based on the magnitude of the error bound associated with them, which can be computed by doing systematic error analysis of summation of floating point numbers.

Notably, the summation results from recursive technique is dependent on the ordering strategies such as increasing, decreasing and Psum (look in the book - page 82 - 4th paragraph. also see how it works in the example given at the bottom of page 82.).

Looking at the R source code for the sum() function which can be obtained from summary.c informs that R implements recursive method in its sum() function.

Also the number of base digits in the floating-point significand is 53, which can be obtained from

.Machine$double.digits
# [1] 53

By setting this number as the precision bits, we can compare the accuracy of sum operation by base R and mpfr() from Rmpfr library for different ordering strategies. Notice that increasing order produces results closer to the ones seen in floating point aware summations, which corroborates the above statement from this book.

I computed chi square statistic using the raw data x.

library('data.table')
library('Rmpfr')
x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)
df1 <- data.frame(x = x1)
setDT(df1)
df1[, p := rep(1/length(x), length(x))]
s_x <- df1[, sum(x)]
df1[, E := s_x * p]
df1[, chi := ((x - E)^2/E)]

precBits <- .Machine$double.digits
x_chi <- data.frame( names = c("x_asc", "x_desc", "x_fp_asc", "x_fp_desc",
                               "chi_asc", "chi_desc", "chi_fp_asc", "chi_fp_desc"))
x_chi$vals <- c( ## x
  df1[order(x), format( sum(x), digits = 22)],
  df1[order(-x), format( sum(x), digits = 22)],
  df1[order(x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],
  df1[order(-x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],
  ## chi
  df1[order(chi), format( sum(chi), digits = 22)],
  df1[order(-chi), format( sum(chi), digits = 22)],
  df1[order(chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)],
  df1[order(-chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)])

x_chi
#         names                    vals
# 1       x_asc       10000000000011000
# 2      x_desc       10000000000010996
# 3    x_fp_asc 10000000000011000.00000
# 4   x_fp_desc 10000000000020000.00000
# 5     chi_asc    99999999999890014218
# 6    chi_desc    99999999999890030592
# 7  chi_fp_asc 99999999999890014208.00
# 8 chi_fp_desc 99999999999833554944.00

Looking at the source code of edit(chisq.test) function informs that there is no sorting operation involved in it.

Also, as pointed out in the comment section, it is not related to the sign (+ve or -ve) of values of raw data used in chisq.test() function. This function does not accept negative values, so it will spit error by halting the function with this message "all entries of 'x' must be nonnegative and finite".

set.seed(2L)
chisq.test(c(rnorm(10, 0, 1)))
# Error in chisq.test(c(rnorm(10, 0, 1))) : 
#   all entries of 'x' must be nonnegative and finite

The difference in the values while summing floating point numbers is concerned with the double precision arithmetic. See the demo below. When the precision of floating point numbers is maintained at 200 digits using the mpfr() function available in Rmpfr package, the sum operation gives identical results regardless of the order of the vector x1 or x2. However, unequal values are observed when no floating point precision is maintained.

No FP Precision:

x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)
## reverse
x2 = matrix(c( rep(1.1, 10000), 10^16 ), 
            nrow = 10001, ncol = 1)

c( format(sum(x1), digits = 22), 
   format(sum(x2), digits = 22))
# [1] "10000000000010996" "10000000000011000"

FP Precision maintained:

library('Rmpfr')
##
prec <- 200
x1 = matrix(c( mpfr( 10^16, precBits = prec),
              rep( mpfr(1.1, precBits = prec), 10000)), 
           nrow = 10001, ncol = 1)

## reverse
x2 = matrix(c( rep(mpfr(1.1, precBits = prec), 10000), 
              mpfr( 10^16, precBits = prec) ), 
           nrow = 10001, ncol = 1)
c( sum(x1), sum(x2))
# 2 'mpfr' numbers of precision  200   bits 
# [1] 10000000000011000.000000000000888178419700125232338905334472656
# [2] 10000000000011000.000000000000888178419700125232338905334472656

The smallest positive floating point number in base R can be obtained from the code below and any number smaller than this number will be truncated in base R, which produces varying results out of sum operation.

.Machine$double.eps
# [1] 2.220446e-16

Comparison of double precision arithmetic aware and unaware functions of chisq.test() function.

The relevant portion of chisq.test() is extracted and a new function chisq.test2() is made with it. Inside, you will see options for comparison before and after applying 250 digits double precision awareness using mpfr() function to chi square statistic. You can see identical results are seen for floating point aware function, but not for the raw data.

# modified chi square function:
chisq.test2 <- function (x, precBits) 
{
  if (is.matrix(x)) {
    if (min(dim(x)) == 1L) 
      x <- as.vector(x)
  }

  #before fp precision
  p = rep(1/length(x), length(x))
  n <- sum(x)
  E <- n * p

  # after fp precision
  x1 <- mpfr(x, precBits = precBits)
  p1 = rep(1/length(x1), length(x1))
  n1 <- sum(x1)
  E1 <- n1 * p1

  # chisquare statistic
  STATISTIC <- c(format(sum((x - E)^2/E), digits=22),           # before decreasing
                 format(sum(sort((x - E)^2/E, decreasing = FALSE)), digits=22), # before increasing
                 sum((x1 - E1)^2/E1),                           # after decreasing 
                 sum(sort((x1 - E1)^2/E1, decreasing = FALSE))) # after increasing

  return(STATISTIC)
}

# data
x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)

chisq.test2(x = x1, precBits=250)

Output:

# [[1]]  # before fp decreasing
# [1] "99999999999890030592"
# 
# [[2]]  # before fp increasing
# [1] "99999999999890014218"
# 
# [[3]]  # after fp decreasing 
# 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230046685
# 
# [[4]]  # after fp increasing
# 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230251906
Sathish
  • 12,453
  • 3
  • 41
  • 59
  • 1
    do you know why the line `STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))` (line 57) is decreasingly sorted ? thanks – user2957945 Dec 27 '17 at 14:32
  • 2
    Am I missing something, or does this not really answer the question? – dww Dec 27 '17 at 18:48
  • 1
    why downvote this answer? OP that asked the question used matrix data of one column. See the source code of `chisq.test()` function. It never used the line 57 of the source code that @ user2957945 is referring in the comment above. Instead the data given in the question using line 94. – Sathish Dec 27 '17 at 23:23
  • one more point, sorting data in decreasing order results in less precise value when compared to sorting data in increasing order. Refer the value `10000000000011000` from without FP precision awareness and increasing order is close to the values obtained from the sum of FP precision aware data – Sathish Dec 27 '17 at 23:52
  • 1
    @user2957945 I don't know why the developer of this function resorted to less precise method. I don't know how to guess.. If you see the code block at line 57 (pearson chisq) and line 94 (chisq for given probabilties), the difference in the approach is the way pvalue is computed, which is using the statistic to see if the statistic is unsual to occur in the chisq distribution. – Sathish Dec 28 '17 at 00:06