0

This is the sample data.

created_date  start_date
2014-12-11    2014-12-10
2014-12-11    2014-12-11
2014-12-12    2014-12-13
2014-12-13    NULL       
2014-12-13    2014-12-13
2014-12-13    2014-12-13
2014-12-23    NULL
2014-12-23    NULL

I'd like to count how many start_date was checked each day, according to the created_date. The value of start_date is not important, only the 'number' of start_dates checked is meaningful.

In this case, the result of for loop should be like this

created_date  count
2014-12-11     2 
2014-12-12     1
2014-12-13     2
2014-12-23     0

I cannot simply use table() because:

table(created_date) will count created_date, not start_date.

>table(created_date)

created_date  count
2014-12-11     2 
2014-12-12     1
2014-12-13     3
2014-12-23     2

table(start_date) won't work either, since it doesn't count the created date of "NULL" and more importantly, the value of start_date itself is meaningless.

>table(start_date)

created_date  count
2014-12-10     1 
2014-12-11     1
2014-12-13     3
NULL           3

I guess for loop should be used, but don't have idea how to code that. Thanks in advance!

  • I added why this question is different from the existing one. table() doesn't work for this code – Woobin Yun Aug 04 '16 at 07:11
  • The way the question is currently written, you *can* use `table` and it will do exactly what you want — you just need to remove rows with `NULL` beforehand! – Konrad Rudolph Aug 04 '16 at 08:59
  • @KonradRudolph Thanks for your comment! But if I remove rows with NULL and use table, created_date Count 2014-12-23 0 this will not appear. What I want to do is counting the number of start_date with value for each created_date. – Woobin Yun Aug 04 '16 at 09:17
  • Fair point. How about creating two tables: one for all data, and one for the `NULL` data only (= removing all nonnull rows), and then subtract the second from the first? – Konrad Rudolph Aug 04 '16 at 09:18
  • What do you mean by subtracting the second from the first? How is it different from removing rows with NULL as you recommended at first? – Woobin Yun Aug 04 '16 at 09:21
  • It’s not the same thing. Unfortunately it’s too complicated to explain in a comment, and since this question is still closed I couldn’t write an answer. So I’ve uploaded an explanation with the code on Github instead: https://gist.github.com/klmr/b7fd96c113e24c178d2630b66bb73a15 – Konrad Rudolph Aug 04 '16 at 10:47
  • @akrun Mind reopening the question? It’s really not a duplicate … – Konrad Rudolph Aug 04 '16 at 10:50

1 Answers1

1

Short version: Use table separately on the complete data and on the empty rows only, subtract second from first.

Long version:

Assuming your data is in x (and that NULLs are actually NAs, see Gist for details):

Count the entries, and put them into data_frames for convenience:

library(dplyr)
all_counts = as_data_frame(table(x$created_date))
na_counts = as_data_frame(table(x[is.na(x$start_date), ]$created_date))

Subtract the na_counts from the full_counts. To do this, we first need to join these two tables. Joining will introduce NAs, which we will replace by 0s:

full_join(all_counts, na_counts, by = 'Var1') %>%
    mutate(n.y = ifelse(is.na(n.y), 0, n.y)) %>%
    mutate(count = n.x - n.y) %>% # And finally, subtract the counts.
    select(created_date = Var1, count)

Result:

| created_date   |   count |
|:---------------|--------:|
| 2014-12-11     |       2 |
| 2014-12-12     |       1 |
| 2014-12-13     |       2 |
| 2014-12-23     |       0 |
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214