9

I would like to understand why sum/min/max functions in R interpret a character string as TRUE when supplied to na.rm, while mean() does not.

My uneducated guess is that as.logical("xyz") returns NA, which is being supplied to na.rm as the argument, which for some strange reason is accepted as TRUE for sum/min/max while it isn't for mean()

The expected output for sum(c(NA, 4, 5), na.rm = "xyz") is an argument is not interpretable as logical error (returned from a mean). I don't understand why that isn't the case.

Axeman
  • 32,068
  • 8
  • 81
  • 94
Plhu
  • 117
  • 1
  • 9
  • 2
    It is not a coincidence that `min/max/sum` are primitives while `mean` is not. The processing of `if (na.rm)` produces an error in `mean.default`, and I assume it does not in `min/max/sum` due to their being primitives. – Rich Scriven May 22 '19 at 00:16
  • This QA is very similar, and points in the right direction of examining the C source code: https://stackoverflow.com/a/14035586/ – John Colby May 22 '19 at 00:19
  • e.g. https://github.com/wch/r-source/blob/1abc6ab6842af405b4b51da4a3422fd5a5153f9a/src/main/summary.c#L442-L448 – John Colby May 22 '19 at 00:20
  • I agree that it would useful if `na.rm` would be evaluated & coerced consistently across the board. Note that `na.rm="FALSE"` is indeed parsed as a logical, so it's not that any string becomes TRUE, cf. `sum(c(1:3,NA), na.rm="xyz") == 6`, `sum(c(1:3,NA), na.rm="TRUE") == 6`, and `sum(c(1:3,NA), na.rm="FALSE") == NA`. – HenrikB May 22 '19 at 15:37
  • Agreed! I don't understand the need for inconsistency here. I am not familiar with C but I would assume some of sort of strict type check should be simple to implement and would enforce consistent behavior across the board. Was definitely a [WAT!?](https://www.destroyallsoftware.com/talks/wat) moment for me. – Plhu May 22 '19 at 21:38

1 Answers1

2

As far as mean is concerned it is quite straightforward. As @Rich Scriven mentions if you type mean.default in the console you see a section of code

if (na.rm) 
   x <- x[!is.na(x)]

which gives you the error.

mean(1:10, na.rm = "abc") #gives

Error in if (na.rm) x <- x[!is.na(x)] : argument is not interpretable as logical

which is similar to doing

if ("abc") "Hello"

Error in if ("abc") "Hello" : argument is not interpretable as logical


Now regarding sum, min, max and other primitive functions which is implemented in C. The source code of these functions is here. There is a parameter Rboolean narm passed into the function.

The way C treats boolean is different.

#include <stdio.h>
#include <stdbool.h>

int main()
{
  bool a = "abc";
  if (a)
    printf("Hello World");
  else
    printf("Not Hello World");
  return 0;
}

If you run the above C code it will print "Hello World". Run the demo here. If you pass a string input to boolean type it is considered as TRUE in C. In fact that is even true with numbers as well

sum(1:10, na.rm = 12)

works as well.

PS - I am no expert in C and know a little bit of R. Finding all these insights took lot of time. Let me know if I have misinterpreted something and provided any false information.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Thanks! I guess character strings and numbers are considered truthy in C but it still perturbs me that the implementation is not consistent with R's rules. I wonder if there is a reason why these primitives haven't been refactored for consistency (with some sort of type check in C). – Plhu May 22 '19 at 15:28
  • 1
    @Puzhu I agree. It would have been much better if these functions showed consistent behavior irrespective of their underlying implementation. – Ronak Shah May 22 '19 at 23:17