0

I have a data.frame with several hundred variables that contains missing values that are denoted by NA. There are 571 observations in total. I'm only interested in 20 of the variables in this data.frame. In other words, I want to define a complete observation as an any observation that has data in all 20 variables of interest.

One way of getting around this is by running a linear regression, which will delete any observations that have missing values. I.e. it will state something like:

(196 observations deleted due to missingness)

This will allow me to infer that my sample size is equal to 571 minus 196. But there must be a better way to do it. Any ideas?

Thank you in advance!

goose144
  • 27
  • 6

2 Answers2

1

If you want to simply remove any observations that contain NA in any variable, use na.omit(). If you want to select only some of the variables, subset them first with subset().

Example:

# some data
df <- data.frame(
  a = c(1,2,3,4,5,NA),
  b = c(NA,2,3,4,5,6),
  c = c(NA,NA,3,4,5,6)
)

# omit rows with NAs
na.omit(df)
#>   a b c
#> 3 3 3 3
#> 4 4 4 4
#> 5 5 5 5

# use only "a" and "b" variables
na.omit(subset(df, select = c("a", "b")))
#>   a b
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5

Created on 2020-07-13 by the reprex package (v0.3.0)

You can count the number of observations with nrow():

nrow(na.omit(df))
#> [1] 3
  • 1
    Brilliant, thank you! For anyone else, my final code reads ```nrow(na.omit(subset(finaldata, select = c("child_age96", "log3Tblood"))))``` – goose144 Jul 13 '20 at 19:03
  • @goose144 Great, I'm glad it works! And thanks so much for your feedback for other people! –  Jul 13 '20 at 19:53
1

Use complete.cases:

df <- data.frame(
  a = c(1,NA,2,NA,3),
  b = c(NA,5,3,5,6),
  c = c(NA,NA,3,5,NA)
)

df[complete.cases(df),]
nrow(df[complete.cases(df),])

Output

 a b c
3 2 3 3

1
slava-kohut
  • 4,203
  • 1
  • 7
  • 24