1

I am fairly new to programming and very new to R.

Problem

When using dplyr's mutate function with nested ifelse statements, I am getting NAs as a result for some records, and correct sums for others. I do not know why I am getting NAs.

Expectation

When running the following code my expectation is that the following would occur:

  • active_m0 would be created with all records that are still "active" (defined as those records without a cancel_date or with a cancel_date equal or prior to the join_date month and year.
  • active_m1 would be created with all records that are still "active" the following month as compared to the join_date.

--

tom35 <- mutate(tom35, active_m0 = sum(
         ifelse(is.na(cancel_date), 
                 1, 
                 ifelse(year(cancel_date) == year(join_date) & 
                              month(cancel_date) == month(join_date),
                         0, 
                         ifelse(year(cancel_date) == year(join_date) & 
                                     month(cancel_date) < month(join_date), 
                                           0, 
                                           1)))))

tom35 <- mutate(tom35, active_m1 = sum(
           ifelse(is.na(cancel_date), 
                 1, 
                 ifelse(year(cancel_date) == year(join_date) & 
                              month(cancel_date) == month(join_date + months(1)),
                         0, 
                         ifelse(year(cancel_date) == year(join_date) & 
                                      month(cancel_date) < month(join_date + months(1)), 
                                         0, 
                                         1)))))

Actuality

  • active_m0 part seems to be working fine.
  • However, active_m1 is working for exactly 33000 of the 56248 records in the dataset. The rest are being populated with NAs rather than sum total.

My Data

I subset the data into the data that has the correct sum for active_m1:

head(select(tom35_clean, join_date, cancel_date, cohort, total_joins, active_m0, active_m1)) Source: local data frame [6 x 6] Groups: cohort [5]

   join_date cancel_date  cohort total_joins active_m0 active_m1
      <date>      <date>  <fctr>       <int>     <dbl>     <dbl>
1 2015-04-23  2015-05-27  2015-4        6493      5953      3806
2 2014-09-22  2014-11-27  2014-9        4093      3614      2275
3 2015-07-07  2015-07-31  2015-7        1923      1719       671
4 2014-11-14  2015-12-09 2014-11        3808      3515      2165
5 2015-04-23  2016-06-04  2015-4        6493      5953      3806
6 2015-02-16  2015-02-25  2015-2        9051      8411      5480


> str(select(tom35_clean, join_date, cancel_date, cohort, total_joins, active_m0, active_m1))
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 33010 obs. of  6 variables:
 $ join_date  : Date, format: "2015-04-23" "2014-09-22" "2015-07-07" "2014-11-14" ...
 $ cancel_date: Date, format: "2015-05-27" "2014-11-27" "2015-07-31" "2015-12-09" ...
 $ cohort     : Factor w/ 31 levels "2014-1","2014-10",..: 19 12 22 3 19 17 17 12 19 17 ...
 $ total_joins: int  6493 4093 1923 3808 6493 9051 9051 4093 6493 9051 ...
 $ active_m0  : num  5953 3614 1719 3515 5953 ...
 $ active_m1  : num  3806 2275 671 2165 3806 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol cohort
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 24
  ..$ : int  3 19 20 21 26 36 40 48 49 57 ...
  ..$ : int  10 25 42 86 96 128 156 174 186 209 ...
  ..$ : int  17 169 858 1571 2072 2074 2079 2399 2797 2914 ...
  ..$ : int  233 234 411 434 560 828 954 3008 3384 4179 ...
  ..$ : int  827 1561 2403 2510 4354 4371 4381 4430 4431
  ..$ : int  13 85 124 277 281 301 377 439 533 568 ...
  ..$ : int  11 29 31 50 51 66 67 84 109 125 ...
  ..$ : int  1 7 24 27 33 47 54 71 73 92 ...
  ..$ : int  201 1605 1819 32857
  ..$ : int  1206 1898 3436 4440 8267
  ..$ : int  56 101 264 685 748 1280 3587 30604
  ..$ : int  5 6 9 12 16 23 30 32 35 38 ...
  ..$ : int  0 4 8 14 15 18 37 41 44 52 ...
  ..$ : int  34 61 114 119 142 160 161 195 212 213 ...
  ..$ : int  2 22 59 60 64 80 89 91 141 148 ...
  ..$ : int  162 435 31958 33005
  ..$ : int 11278
  ..$ : int  130 131 1196 14167 33006
  ..$ : int  46 138 331 420 498 2802 2966 3026 9610 33007
  ..$ : int  28 145 2751 16513 17922
  ..$ : int  1398 3140 25718 25719
  ..$ : int  33008 33009
  ..$ : int  865 16497
  ..$ : int 172
 - attr(*, "group_sizes")= int  3808 2477 96 17 9 1549 1829 4093 4 5 ...
 - attr(*, "biggest_group_size")= int 9051
 - attr(*, "labels")='data.frame':  24 obs. of  1 variable:
  ..$ cohort: Factor w/ 31 levels "2014-1","2014-10",..: 3 4 5 6 7 9 10 12 14 15 ...
  ..- attr(*, "vars")=List of 1
  .. ..$ : symbol cohort
  ..- attr(*, "drop")= logi TRUE

And the data that has the NAs populated rather than sums for active_m1:

> head(select(tom35_nas, join_date, cancel_date, cohort, total_joins, active_m0, active_m1))
Source: local data frame [6 x 6]
Groups: cohort [5]

   join_date cancel_date cohort total_joins active_m0 active_m1
      <date>      <date> <fctr>       <int>     <dbl>     <dbl>
1 2015-03-04  2015-04-17 2015-3        4354      4005        NA
2 2015-05-06  2015-07-02 2015-5        4494      4048        NA
3 2014-05-29        <NA> 2014-5        1325      1201        NA
4 2015-03-05  2015-05-10 2015-3        4354      4005        NA
5 2014-08-19  2014-09-26 2014-8        4502      4069        NA
6 2014-01-02        <NA> 2014-1        1622      1406        NA


> str(select(tom35_nas, join_date, cancel_date, cohort, total_joins, active_m0, active_m1))
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 23248 obs. of  6 variables:
 $ join_date  : Date, format: "2015-03-04" "2015-05-06" "2014-05-29" "2015-03-05" ...
 $ cancel_date: Date, format: "2015-04-17" "2015-07-02" NA "2015-05-10" ...
 $ cohort     : Factor w/ 31 levels "2014-1","2014-10",..: 18 20 8 18 11 1 18 13 2 1 ...
 $ total_joins: int  4354 4494 1325 4354 4502 1622 4354 3615 3336 1622 ...
 $ active_m0  : num  4005 4048 1201 4005 4069 ...
 $ active_m1  : num  NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "vars")=List of 1
  ..$ : symbol cohort
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 7
  ..$ : int  5 9 12 17 24 51 57 89 95 96 ...
  ..$ : int  8 13 20 28 30 33 37 59 60 61 ...
  ..$ : int  2 27 29 35 49 82 83 92 110 111 ...
  ..$ : int  4 21 23 25 38 46 50 52 67 74 ...
  ..$ : int  7 18 26 39 40 41 53 54 64 69 ...
  ..$ : int  0 3 6 11 16 22 36 45 47 48 ...
  ..$ : int  1 10 14 15 19 31 32 34 42 43 ...
 - attr(*, "group_sizes")= int  1622 3336 1325 4502 3615 4354 4494
 - attr(*, "biggest_group_size")= int 4502
 - attr(*, "labels")='data.frame':  7 obs. of  1 variable:
  ..$ cohort: Factor w/ 31 levels "2014-1","2014-10",..: 1 2 8 11 13 18 20
  ..- attr(*, "vars")=List of 1
  .. ..$ : symbol cohort
  ..- attr(*, "drop")= logi TRUE

Both lubridate and dplyr are being used.

I cannot figure out why this is not working as it should for active_m1. Any help is appreciated.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
sharly
  • 237
  • 2
  • 7
  • 1
    Please use `dput` per this guide for a reproducible problem http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Tung Jul 14 '16 at 22:56
  • Is your version of dplyr up to date (currently at 0.5.0)? Does the issue persist if you use `na.rm = TRUE` in `sum`? – aosmith Jul 15 '16 at 15:04

0 Answers0