I'm using the rattle package to do some data cleaning and I'm consider the first variable X in my dataset. It reports, when I do in the first tab, the "Data" tab, I get some basic of the dataset and it says that variable X that has 1243 missing values. This is also the value that I get if I use sum(is.na(my_df[,1]))
.
On the next tab, the "Explore" tab, when I check "Summary" it now says that I have just 942 NAs in variable X.
How can I makes sense of these different numbers? I manually browsed a bit through my dataset and looked at some rows that had NAs and those NAs all look the same (I understand that sometimes there are different types of NAs).
(Side question: sum(is.na(my_df[,1]), na.rm = FALSE)
and sum(is.na(my_df[,1]),na.rm = TRUE)
also both produce the same number 1243, why? I would have expected that one gives me length(my_df[,1])-1243
.)
EDIT Here is the dataset that has this problem: https://wetransfer.com/downloads/cf454b2c12857a4e3770102a7222422f20171019153755/516fb0 .
The numbers in that are slightly different, instead of 1243, we have 88 NAs according to the "Data" tab in rattle() (or, equivalently, according to summary(ten_df)
), and 62 NAs according to the "Explore" with checked Summary tab.
But now I suspect my dataset is broken because before uploading the complete one, I wanted to originally only upload one illustrative column. But when I execute
ten_df = read.csv("ten.csv",sep=";")
my_df = as.data.frame(ten_df[,3])
since I want to look at the third column with var2
and my_df
was what I wanted to upload originally, the last command returns an error
Warning messages:
1: In rep(no, length.out = length(ans)) :
'x' is NULL so the result will be NULL
Also, when selecting afterward my_df to analyse it with rattle, rattle says "0 input variable" in the bar at the bottom where it is giving feedback. How can this be?