4

I'm using the rattle package to do some data cleaning and I'm consider the first variable X in my dataset. It reports, when I do in the first tab, the "Data" tab, I get some basic of the dataset and it says that variable X that has 1243 missing values. This is also the value that I get if I use sum(is.na(my_df[,1])).

On the next tab, the "Explore" tab, when I check "Summary" it now says that I have just 942 NAs in variable X.

How can I makes sense of these different numbers? I manually browsed a bit through my dataset and looked at some rows that had NAs and those NAs all look the same (I understand that sometimes there are different types of NAs).

(Side question: sum(is.na(my_df[,1]), na.rm = FALSE) and sum(is.na(my_df[,1]),na.rm = TRUE) also both produce the same number 1243, why? I would have expected that one gives me length(my_df[,1])-1243.)


EDIT Here is the dataset that has this problem: https://wetransfer.com/downloads/cf454b2c12857a4e3770102a7222422f20171019153755/516fb0 .

The numbers in that are slightly different, instead of 1243, we have 88 NAs according to the "Data" tab in rattle() (or, equivalently, according to summary(ten_df) ), and 62 NAs according to the "Explore" with checked Summary tab.

But now I suspect my dataset is broken because before uploading the complete one, I wanted to originally only upload one illustrative column. But when I execute

ten_df = read.csv("ten.csv",sep=";") 
my_df = as.data.frame(ten_df[,3])

since I want to look at the third column with var2 and my_df was what I wanted to upload originally, the last command returns an error

Warning messages:
1: In rep(no, length.out = length(ans)) :
  'x' is NULL so the result will be NULL

Also, when selecting afterward my_df to analyse it with rattle, rattle says "0 input variable" in the bar at the bottom where it is giving feedback. How can this be?

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
billyboy
  • 49
  • 1
  • 5
  • As for your side question, `is.na` can only return `TRUE/FALSE`, argument `na.rm` is irrelevant. To see this try `x <- c(1:3, NaN, NA, 4, 5, NA); sum(is.na(x))`. As for the difference in reported values of missing values, it's hard to tell without seeing the data. I would trust `summary(X)`. – Rui Barradas Oct 19 '17 at 10:28
  • @RuiBarradas Thanks! – billyboy Oct 19 '17 at 15:47

1 Answers1

10

?NA

NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.

class(NA)             # "logical"
class(NA_integer_)    # "integer"
class(NA_real_)       # "numeric"
class(NA_complex_)    # "complex"
class(NA_character_)  # "character"
is.na(NA)             # TRUE
is.na(NA_integer_)    # TRUE
is.na(NA_real_)       # TRUE
is.na(NA_complex_)    # TRUE
is.na(NA_character_)  # TRUE
identical(NA,NA_integer_)    # FALSE
identical(NA,NA_real_)       # FALSE
identical(NA,NA_complex_)    # FALSE
identical(NA,NA_character_)  # FALSE
identical(NA_character_,as.character(NA)) # TRUE
identical(NA_real_,as.numeric(NA))        # TRUE
identical(as.logical(NA_real_),NA)        # TRUE

So NA is a logical. So why do we use NA pretty much everywhere without worrying about the class ? because of coercion rules :

class(c(NA,1)[1])                # "numeric"
identical(c(NA,1),c(NA_real_,1)) # TRUE
c(NA_character_,1)               # [1] NA  "1"

Depending on class, NA might also be printed differently

Now back to your question, I can't answer the first one because you offer no reproducible data, but as for why sum(is.na(my_df[,1]), na.rm = FALSE) and sum(is.na(my_df[,1]),na.rm = TRUE), it's because is.na(my_df[,1]) is only made of TRUE and FALSE, not NAs.

You can also try length(na.omit(my_df[,1])).

EDIT:

A given column of a data.frame has only elements of one class, so you won't have different NA_character_ and NA_real_ in the same column.

Something that happens often however is that you'll have some strings whose value is "NA", you shouldn't of course expect is.na to detect those. In these cases you can use df[df == "NA"] <- NA to have regular NAs instead of "NA" strings in your data.frame

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Thanks, this is a really good, in-depth answer so far, I learned a lot from you! I wish I could upvote, but I can do that only after I have 15 rep points; once I reach that I'll try to upvote retrospectively. I also uploaded the dataset, could you have a look at it, so that the first answer can be answered as well? – billyboy Oct 19 '17 at 15:44
  • unfortunately I can't access this website from my work computer, but my guess is that `sum(is.na(df[,1]))` will give you the correct answer, and for some reason, the alternative values you find are either not up to date or just estimation. It would help to really see a contradiction through code commands and not what you see in tabs :). You can also cut your data set to a reasonable size and explore it manually, it's already not that big. – moodymudskipper Oct 19 '17 at 15:50