NA when trying to summarize a subset of data (R)

Question

Whole vector is ok and has no NAs:

> summary(data$marks)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00    6.00    6.02    7.00    7.00

> length(data$marks)
[1] 2528

However, when trying to calculate a subset using a criteria I receive lots of NAs:

> summary(data[data$student=="John",]$marks)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   6.000   6.000   6.169   7.000   7.000     464

> length(data[data$student=="John",]$marks)
[1] 523

Please provide a reproducible example. Additionally, instead of `data[data$student=="John",]$marks` I would recommend, `data[data$student=="John", "marks"]`, it is more traditional as well as easier to read — Jacob H, Dec 03 '15 at 00:17
Are there missing values for `student`. If any values of `student` are missing, even if there are no missing values for `student=="John"` and no missing values for `marks`, then you would get `NA`s. What happens if you do `summary(data[which(data$student=="John"), ]$marks)`? — eipi10, Dec 03 '15 at 00:21
Yes, there are actually missing values for `student`. However, what is the logic behind it if I specified certain using exact match? What is a solution then? — Genrikh Lukianchuk, Dec 03 '15 at 00:22

eipi10 · Accepted Answer · 2015-12-03T00:35:47.260

1

I think the problem is that you have missing values for student. As a result, when you subset by student, all the NA values for student end up producing NA for marks when you take your subset. Wrap the subsetting condition in which() to avoid this problem. Here are a few examples that will hopefully clarify what's happening:

# Fake data
set.seed(103)
dat = data.frame(group=rep(LETTERS[1:3], each=3), 
                 value=rnorm(9))
dat$group[1] = NA

dat$value
dat[dat$group=="B", "value"]
dat[which(dat$group=="B"), "value"]

# Simpler example
x = c(10,20,30,40, NA)

x>20
x[x>20]

which(x>20)
x[which(x>20)]

edited Dec 03 '15 at 00:35

answered Dec 03 '15 at 00:28

eipi10

91,525
24
209
285

Can you add a solution then please? So I need to leave `NAs` in my dataset (so subsetting won't work in my case), but to calculate summary without them. – Genrikh Lukianchuk Dec 03 '15 at 00:31
Wrap your subset in `which()` as in my example. `which` returns the indices of the rows that match the condition, ignoring NA values. – eipi10 Dec 03 '15 at 00:33

score 0 · Answer 2 · answered Dec 03 '15 at 00:33

0

First Note that NA=="foo" results in NA. When subsetting a vector with a NA value the result is NA.

t = c(1,2,3)
t[c(1,NA)]

answered Dec 03 '15 at 00:33

bluefish

395
2
13

score 0 · Answer 3 · answered Jun 25 '18 at 13:30

0

a tidyverse solution. I find these to be easier to read than base R.

library(tidyverse)

data %<%
  filter(student == "John") %<%
  summary(marks)

answered Jun 25 '18 at 13:30

Ben G

4,148
2
22
42

NA when trying to summarize a subset of data (R)

3 Answers3

Linked