2

The answers to my last question helped me understand the dplyr n(). But I still couldn't apply to the problem I am trying to solve. So, here I am posting the data and actual problem. The data looks as follows

  finalCalledPartyNumber Call Day           Call Date             Block Answered Call Duration Opty ID Opty Created Date BC ID
             19183598259   Monday 2016-12-05 12:26:46 Block 2(12:30 pm)             1        5    1234              <NA>  <NA>
             17135796188   Monday 2016-12-05 12:18:24 Block 2(12:30 pm)             1       16    <NA>              <NA>  3456
             19183598259   Monday 2016-12-05 15:43:33  Block 4(5:30 pm)             1       10    <NA>              <NA>  4567
             17135796188  Tuesday 2016-12-06 11:53:50 Block 2(12:30 pm)             1       85    <NA>              <NA>  5566
             19014741000  Tuesday 2016-12-06 12:00:11 Block 2(12:30 pm)             1       42    6789              <NA>  <NA>
             19014741000  Tuesday 2016-12-06 10:43:20 Block 2(12:30 pm)             1       58    2345              <NA>  <NA>
  BC Created Date
             <NA>
             <NA>
             <NA>
             <NA>
             <NA>
             <NA>

For each finalCalledPartyNumber, I want to count the number of times each finallCalledPartyNumber was called, number of Opty ID and BC ID. The dataset is huge and <NA> just means no Opty ID or BC ID was created.

The Output:

finalCalledPartyNumber    n_opty    n_BC    n_calls
           19183598259         1       1    2
           17135796188         0       2    2
           19014741000         2       0    2

I appreciate your help. Thank you.

Krishnang K Dalal
  • 2,322
  • 9
  • 34
  • 55
  • If you don't have `NULL`s that you want to exclude, `count` doesn't really need an argument : you can replace `count([finalCalledPartyNumber])` by `count(1)`. So `n()` should be just fine here, and if you have nulls, just filter them out. – HubertL Jun 12 '17 at 21:14
  • [Please reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – M-- Jun 12 '17 at 21:17

2 Answers2

5

n() counts the number of rows in each group. if you don't want to count duplicates of particular columns, you can use n_distinct() and pass in the name(s) of columns. You can also use count() as a shorthand for group_by() + summarize(count = n()), and tally() as a shorthand for the summarize part.

Here's a demonstration of the different verbs and functions:


library(dplyr)

mtcars %>% 
  count(cyl)
#> # A tibble: 3 x 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14

mtcars %>%
  group_by(cyl) %>%
  summarize(n = n())
#> # A tibble: 3 x 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14


mtcars %>%
  group_by(cyl) %>%
  tally()
#> # A tibble: 3 x 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14

mtcars %>%
  group_by(cyl) %>%
  summarize(n_distinct = n_distinct(mpg))
#> # A tibble: 3 x 2
#>     cyl n_distinct
#>   <dbl>      <int>
#> 1     4          9
#> 2     6          6
#> 3     8         12

mtcars %>%
  group_by(cyl) %>%
  distinct(mpg) %>%
  tally()
#> # A tibble: 3 x 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4     9
#> 2     6     6
#> 3     8    12
yeedle
  • 4,918
  • 1
  • 22
  • 22
4

This is another way of achieving the result you were looking for using dplyr, but it doesn't use n()

 df %>%  group_by(finalCalledPartyNumber) %>% 
          summarise(n_opty = sum(!is.na(Opty.Id)),
                    n_BC = sum(!is.na(BC.ID)),
                    n_calls = sum(!is.na(Call)))

Output:

 # A tibble: 3 x 4
   finalCalledPartyNumber  n_opty  n_BC n_calls
                     <dbl>  <int> <int>   <int>
  1            17135796188      0     2       2
  2            19014741000      2     0       2
  3            19183598259      1     1       2
Megha John
  • 153
  • 1
  • 12