5

I am using the Titanic Data from Kaggle. I am trying to find the number of missing values in each column using a simple function.

I was able to find the number of missing values for each column using the code below:

length(which(is.na(titanic_data$PassengerId)))
length(which(is.na(titanic_data$Survived)))
length(which(is.na(titanic_data$Pclass)))
length(which(is.na(titanic_data$Name)))
length(which(is.na(titanic_data$Sex)))
length(which(is.na(titanic_data$Age)))
length(which(is.na(titanic_data$SibSp)))
length(which(is.na(titanic_data$Parch)))
length(which(is.na(titanic_data$Ticket)))
length(which(is.na(titanic_data$Fare)))
length(which(is.na(titanic_data$Cabin)))
length(which(is.na(titanic_data$Embarked)))

I did not want to be repeating code for each column. So I wrote the following function:

missing_val<- function(x,y){
  len <-length(which(is.na(x$y)))
  len
}

#create a list of all column names
cols<- colnames(titanic_data)
cols

#call the function
missing_val(titanic_data,cols)

I keep getting a singular zero when executing missing_val function, when I know for a fact that there are missing values in Cabin and Embarked columns.

What I am trying to get is something like, 0,0,0,0,0,0,0,0,687,2 indicating the fact that there are 687 missing variables in Cabin column and 2 missing in Embark column.

What am I doing wrong here? Any hint would be appreciated. Thx

  • 1
    More ways to do this: https://sebastiansauer.github.io/sum-isna/ – Jon Spring Oct 14 '18 at 23:27
  • Thanks for a great resource! – R_and_Python_noob Oct 14 '18 at 23:36
  • @Uwe posted this function to find unique and NA values as comment in another question: ```totaluniquevals <- function(df) data.frame(Row.Name = names(df), TotalUnique = sapply(df, function(x) length(unique(x))), IsNA = sapply(df, function(x) sum(is.na(x))))```. Posted in [Unique values in each of the columns of a data frame](https://stackoverflow.com/questions/19761899/unique-values-in-each-of-the-columns-of-a-data-frame) – Russ Thomas Oct 15 '18 at 01:35

2 Answers2

15

If I'm not mistaken, sapply is not vectorized. Can use colSums and is.na directly

>>> colSums(is.na(titanic_train))
rafaelc
  • 57,686
  • 15
  • 58
  • 82
1

You can do this with sapply

library(titanic)
data(titanic_train)
sapply(titanic_train, function(x) sum(is.na(x)))
PassengerId    Survived      Pclass        Name         Sex         Age 
          0           0           0           0           0         177 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0           0           0           0 
G5W
  • 36,531
  • 10
  • 47
  • 80