Why do NAs appear in some valid (not missing) variables in some subsets of a data frame

Question

This problem has me baffled; I'm not an experience R user so what I've done may not be elegant but it's not complicated and I don't understand the problem.

I begin with a simple data frame that has 6 columns and several hundred rows. The data columns are Year, Month, Day, and three numeric variables. There may be several rows that have the same values of Year, Month, and Day. Here is an example:

> thisFrame
    Year Month Day trans_fac   dist     var
1   2003     3  23   42.3475  1.858 1.48190
2   2003     3  23   42.3475  2.779 1.42260
3   2003     3  23   42.3475  4.145 1.39150
4   2003     3  23   42.3475  5.069 1.37860
5   2003     3  23   42.3475  6.439 1.42050
6   2003     3  23   42.3475  8.736 2.54290
7   2003     3  23   42.3475  9.661 1.29120
8   2003     3  23   42.3475 11.040 1.24360
9   2003     3  23   42.3475 11.960 1.32190
10  2003     3  23   42.3475 13.340 1.34820
11  2003     3  23   42.3475 14.270 1.34630
12  2003     3  23   42.3475 15.640 1.37820
13  2003     3  23   42.3475 16.570 1.39550
[some rows snipped]
24  2003     3  23   42.3475 29.840 1.09530
25  2003     4  11   42.3475  2.091 2.62980
26  2003     4  11   42.3475  3.557 1.61910
27  2003     4  11   42.3475  5.446 1.03760
28  2003     4  11   42.3475  7.099 0.93600
29  2003     4  11   42.3475  8.798 1.02190
30  2003     4  11   42.3475 10.630 1.03940
31  2003     4  11   42.3475 12.240 0.96743
32  2003     4  11   42.3475 14.110 0.95497

Because I want to operate on each day's data independently, I calculate the Julian(unix) day for each row and add the variable jdays to the data frame and then find the unique days.

days <- as.Date(ISOdate(thisFrame$Year,thisFrame$Month,thisFrame$Day))
thisFrame$jdays  <- as.integer(days)
uniq_days <- unique(thisFrame$jdays)           
nudays    <- length(uniq_days)          # number of unique days

I then loop through the number of unique days and create new data frames by subsetting the original frame based on the unique day. Then I want to print the results of the operation along with the day, month, and year of the input set. Should be simple, right? Well, the results are exactly what I want sometimes and sometimes, I get NA values for the day, month, and year even though they are present in the subset. I've tried this with several different input data frames and haven't found any pattern that would help me understand why this is happening.

for (i in 1:nudays) {
 thisSet <- thisFrame[thisFrame$jdays == uniq_days[i],]
 print(thisSet)
 print(c(i, thisSet$Day[i], thisSet$Month[i], thisSet$Year[i])
}

Expected result:

[1] "This is subset  1"
   Year Month Day trans_fac   dist     var jdays
1  2003     3  23   44.4335  2.011 1.12240 12134
2  2003     3  23   44.4335  3.180 0.92435 12134
3  2003     3  23   44.4335  4.147 0.95406 12134
[lines snipped]
28 2003     3  23   44.4335 29.870 0.75302 12134
[1]    1    3   23 2003
[1] "This is subset  2"
   Year Month Day trans_fac   dist     var jdays
29 2003     3  26   44.4335  3.514 1.01300 12137
30 2003     3  26   44.4335  5.275 0.74062 12137
31 2003     3  26   44.4335  7.031 0.67548 12137
[lines snipped]
45 2003     3  26   44.4335 31.220 0.58399 12137
[1]    2    3   26 2003

etc. Until we get to

[1] "This is subset  18"
    Year Month Day trans_fac   dist     var jdays
358 2003     8  18   44.4335  2.075 0.85803 12282
359 2003     8  18   44.4335  3.524 0.71728 12282
[lines snipped]
374 2003     8  18   44.4335 30.320 0.76502 12282
[1] 18 NA NA NA

but then, we back to expected behavior

[1] "This is subset  19"
    Year Month Day trans_fac   dist     var jdays
375 2003     8  19   44.4335  2.475 1.17220 12283
376 2003     8  19   44.4335  3.875 0.87088 12283
[lines snipped]
397 2003     8  19   44.4335 30.070 0.76463 12283
[1]   19    8   19 2003

Until we get to

[1] "This is subset  21"
    Year Month Day trans_fac   dist     var jdays
418 2003     9   2   44.4335  1.781 2.00410 12297
419 2003     9   2   44.4335  3.783 0.96007 12297
420 2003     9   2   44.4335  5.479 0.85195 12297
[lines snipped]
433 2003     9   2   44.4335 28.530 0.89522 12297
[1] 21 NA NA NA

And back to expected result

[1] "This is subset  24"
    Year Month Day trans_fac   dist     var jdays
464 2003     9  17   44.4335  1.173 1.80490 12312
465 2003     9  17   44.4335  2.587 1.04510 12312
[lines snipped]
487 2003     9  17   44.4335 29.770 0.82791 12312
[1]   24    9   17 2003

and so on.

I'm not seeing the problem and would appreciate any advice. Thanks.

It would help if you made a fully [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example); with all the line skipping it's hard to follow along a reproduce your results. But you should probably look into using `split()` or `by()` instead to split your data.frame up in the first place. — MrFlick, Jan 21 '15 at 20:16
Thank you @MrFlick for your note and for answering my question. I wasn't sure about how much to include in the question, nor how to include a full data set. I thought what I did include was sufficient to be understandable - guess I was wrong about that, but you hit right on the problem anyway. — Fleetboat, Jan 21 '15 at 20:47
A smaller (minimal) data set is always preferred, but it should be run-able. Using "lines snipped" makes it difficult to re-create the problem. — MrFlick, Jan 21 '15 at 20:49

score 3 · Answer 1 · answered Jan 21 '15 at 20:21

The problem is

print(c(i, thisSet$Day[i], thisSet$Month[i], thisSet$Year[i])

Your i values increase for each julian date, but thisSet only contains values for a particular day. So if the 10th day (i==10) only had three rows (nrows(thisSet)==3), you'd be trying to index the 10th element of thisSet which wouldn't exist. I'm not sure what you want to be printing out there, but replacing the i with 1 should prevent the NA values by always selecting the first row.

print(c(i, thisSet$Day[1], thisSet$Month[1], thisSet$Year[1])

But this really isn't a great way to split a data.frame. How about using the split() command

split(thisFrame, thisFrame[,c("Year","Month","Day")], drop=TRUE)

Alternatively, you could use the by() command if you wanted to apply a function to each subset.

Why do NAs appear in some valid (not missing) variables in some subsets of a data frame

1 Answers1