0

Whole vector is ok and has no NAs:

> summary(data$marks)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00    6.00    6.02    7.00    7.00

> length(data$marks)
[1] 2528

However, when trying to calculate a subset using a criteria I receive lots of NAs:

> summary(data[data$student=="John",]$marks)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   6.000   6.000   6.169   7.000   7.000     464

> length(data[data$student=="John",]$marks)
[1] 523
  • Please provide a reproducible example. Additionally, instead of `data[data$student=="John",]$marks` I would recommend, `data[data$student=="John", "marks"]`, it is more traditional as well as easier to read – Jacob H Dec 03 '15 at 00:17
  • 1
    Are there missing values for `student`. If any values of `student` are missing, even if there are no missing values for `student=="John"` and no missing values for `marks`, then you would get `NA`s. What happens if you do `summary(data[which(data$student=="John"), ]$marks)`? – eipi10 Dec 03 '15 at 00:21
  • Yes, there are actually missing values for `student`. However, what is the logic behind it if I specified certain using exact match? What is a solution then? – Genrikh Lukianchuk Dec 03 '15 at 00:22
  • NA gets pulled out by `==` - `x <- c(1,2,NA); x[x==1]` – jeremycg Dec 03 '15 at 00:27

3 Answers3

1

I think the problem is that you have missing values for student. As a result, when you subset by student, all the NA values for student end up producing NA for marks when you take your subset. Wrap the subsetting condition in which() to avoid this problem. Here are a few examples that will hopefully clarify what's happening:

# Fake data
set.seed(103)
dat = data.frame(group=rep(LETTERS[1:3], each=3), 
                 value=rnorm(9))
dat$group[1] = NA

dat$value
dat[dat$group=="B", "value"]
dat[which(dat$group=="B"), "value"]

# Simpler example
x = c(10,20,30,40, NA)

x>20
x[x>20]

which(x>20)
x[which(x>20)]
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Can you add a solution then please? So I need to leave `NAs` in my dataset (so subsetting won't work in my case), but to calculate summary without them. – Genrikh Lukianchuk Dec 03 '15 at 00:31
  • Wrap your subset in `which()` as in my example. `which` returns the indices of the rows that match the condition, ignoring NA values. – eipi10 Dec 03 '15 at 00:33
0

First Note that NA=="foo" results in NA. When subsetting a vector with a NA value the result is NA.

t = c(1,2,3)
t[c(1,NA)]
bluefish
  • 395
  • 2
  • 13
0

a tidyverse solution. I find these to be easier to read than base R.

library(tidyverse)

data %<%
  filter(student == "John") %<%
  summary(marks)
Ben G
  • 4,148
  • 2
  • 22
  • 42