1

I have a list of accounts (300k plus rows), going back six years, with a user number, open and close dates, and other information, such as location. We offer a variety of accounts, and a user can have one or several, in any combination, and both in succession as well as overlapping.

I've been asked to find out how many users we have in any given month. They'd like it split by location, as well as total.

so I have a table like this:

   User    Open       Close      Area 
 1 A       2018-02-13 2018-07-31 West 
 2 B       2018-02-26 2018-06-04 North
 3 B       2018-02-27 2018-03-15 North
 4 C       2018-02-27 2018-05-26 South
 5 C       2018-03-15 2018-06-03 South
 6 D       2018-03-20 2018-07-02 East 
 7 E       2018-04-01 2018-06-19 West 
 8 E       2018-04-14 2018-05-04 West 
 9 F       2018-03-20 2018-04-19 North
10 G       2018-04-26 2018-07-04 South
11 H       2017-29-12 2018-03-21 East
12 I       2016-11-29 2020-04-10 West
13 J       2018-01-31 2018-12-20 West
14 K       2017-10-31 2018-10-30 North
15 K       2018-10-31 2019-10-30 North

And I want to get to one that looks something like this:

      Month  Total North  East South  West
1 Feb 18     3     1     0     1     1
2 Mar 18     5     2     1     1     1
3 Apr 18     7     2     1     2     2
4 May 18     6     1     1     2     2
5 Jun 18     6     1     1     2     2
6 Jul 18     3     0     1     1     1

I can filter the data to get to what I need for individual months using

 df%>%
   filter(Open <= as.Date("2018-04-30") & Close >= as.Date("2018-04-01")) %>%
distinct(PERSON_ID, .keep_all = TRUE) %>%
   count(Area) 

But what I can't figure out is how to repeat that for every month in the data set automatically. Is there any where of getting r to repeat the above for every month in my data set, and then pass the results into a second table?

Any and all help gratefully received, and many thanks for your time.

Edit: added examples to the source data where Matin Gal's solution returned NA for years

3 Answers3

1

Here's how I'd do it:

library(tidyverse)

set.seed(14159)

## generating some data that looks roughly
##  like your data

data <- tibble(
  user = sample(LETTERS[1:5], size = 20, replace = TRUE),
  open = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
  close = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
  area = sample(c("N", "E", "S", "W"), 20, replace = T)
) %>%
  filter(
    close > open
  )

data
#> # A tibble: 9 × 4
#>   user  open       close      area 
#>   <chr> <date>     <date>     <chr>
#> 1 A     1999-04-03 1999-07-28 N    
#> 2 B     1999-01-27 1999-05-12 W    
#> 3 B     1999-06-05 1999-12-29 W    
#> 4 C     1999-09-26 1999-12-30 W    
#> 5 C     1999-04-21 1999-12-04 E    
#> 6 C     1999-08-11 1999-12-12 N    
#> 7 A     1999-02-13 1999-09-16 W    
#> 8 E     1999-02-17 1999-05-21 E    
#> 9 B     1999-07-26 1999-08-16 S

## figuring out what months are in between open and close
get_months_in_range <- function(open, close) {
  seq.Date(
    open,
    close,
    by = "month"
  ) %>%
    list()
}

data %>%
  rowwise() %>%
  mutate(
    Month = get_months_in_range(open, close)
  ) %>%
  ungroup() %>%
  unnest_longer(
    col = Month
  ) %>%
  count(Month, area) %>%
  pivot_wider(
    names_from = area,
    values_from = n,
    values_fill = 0
  ) %>%
  rowwise() %>%
  mutate(
    Total = sum(
      c_across(
        -Month
      )
    )
  ) %>%
  ungroup()
#> # A tibble: 45 × 6
#>    Month          W     E     N     S Total
#>    <date>     <int> <int> <int> <int> <int>
#>  1 1999-01-27     1     0     0     0     1
#>  2 1999-02-13     1     0     0     0     1
#>  3 1999-02-17     0     1     0     0     1
#>  4 1999-02-27     1     0     0     0     1
#>  5 1999-03-13     1     0     0     0     1
#>  6 1999-03-17     0     1     0     0     1
#>  7 1999-03-27     1     0     0     0     1
#>  8 1999-04-03     0     0     1     0     1
#>  9 1999-04-13     1     0     0     0     1
#> 10 1999-04-17     0     1     0     0     1
#> # … with 35 more rows

Created on 2021-08-18 by the reprex package (v2.0.1)

It's not the world's sexiest solution, but I think it'll get you where you're trying to go. Basically, I just make a helper function that gives me all the dates between open and close and then you can group by those to figure out how many users you have in any given month. Let me know if you want more explanation about what the long chain of dplyr stuff is doing.

Matt Kaye
  • 521
  • 4
  • 5
1

This is a general solution working for dates spanning over more than one year.

library(dplyr)
library(tidyr)
library(lubridate)

data %>%
  group_by(rn = row_number()) %>%
  mutate(seq = list(seq(month(Open), month(Close) + 12 * (year(Close) - year(Open))))) %>% 
  unnest(seq) %>%
  mutate(
    seq_2 = (seq - 1) %% 12 + 1,
    month = month(seq_2, label = TRUE),
    year  = year(Open + months(seq - first(seq)))
    ) %>% 
  ungroup() %>% 
  distinct(User, month, year, Area) %>% 
  count(month, year, Area) %>% 
  pivot_wider(
    names_from = "Area", 
    values_from = "n", 
    values_fill = 0
    ) %>% 
  mutate(Total = rowSums(across(c(North, South, West, East))))

returns

  month  year North South  West  East Total
  <ord> <dbl> <int> <int> <int> <int> <dbl>
1 Feb    2018     1     1     1     0     3
2 Mar    2018     2     1     1     1     5
3 Apr    2018     2     2     2     1     7
4 May    2018     1     2     2     1     6
5 Jun    2018     1     2     2     1     6
6 Jul    2018     0     1     1     1     3

Data

df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), User = c("A", 
"B", "B", "C", "C", "D", "E", "E", "F", "G"), Open = structure(c(17575, 
17588, 17589, 17589, 17605, 17610, 17622, 17635, 17610, 17647
), class = "Date"), Close = structure(c(17743, 17686, 17605, 
17677, 17685, 17714, 17701, 17655, 17640, 17716), class = "Date"), 
    Area = c("West", "North", "North", "South", "South", "East", 
    "West", "West", "North", "South")), problems = structure(list(
    row = 10L, col = "Area", expected = "", actual = "embedded null", 
    file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
    cols = list(id = structure(list(), class = c("collector_double", 
    "collector")), User = structure(list(), class = c("collector_character", 
    "collector")), Open = structure(list(format = ""), class = c("collector_date", 
    "collector")), Close = structure(list(format = ""), class = c("collector_date", 
    "collector")), Area = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • Could you include those cases in your sample data? – Martin Gal Aug 23 '21 at 18:14
  • 1
    I've managed to get it sorted by using the advice in [this post](https://stackoverflow.com/questions/22628863/add-subtract-6-months-bond-time-in-r-using-lubridate) . by changing the part of the mutate the creates year to: `year(ceiling_date(as.Date(ELEMENT_START_DATE) %m+% months(seq - first(seq)), unit = "month") - days(1))` I've got it to work. Thanks ever so much. – Beautiful Night Aug 27 '21 at 10:43
0

welcome to SO. I can't test this code as you haven't provided a snippet of your data in the right format (see below for a suggestion on this point), but I think the basic idea of what you want to do is extract a month-year value from Open and then use group_by. For example:

library(lubridate)
library(dplyr)

df %>% mutate(
  Date = dmy(Open),
  Month_Yr = format_ISO8601(Date, precision = "ym")) %>% 
  group_by(Month_Yr) %>% 
  distinct(PERSON.ID, .keep_all = TRUE) %>%
  count(Area) 

Generally when sharing data on SO it's best to use a dput. See ?dput for info on how to use it if you're unsure.

C.Robin
  • 1,085
  • 1
  • 10
  • 23
  • 1
    I think this misses that accounts are active between open and close, but I'm not 100% sure I understand the question right so I may be wrong – Matt Kaye Aug 18 '21 at 22:16
  • It could well do. I wasn't 100% sure either so erred for a more minimalist understanding. Your post is much more detailed and likely closer to (or exactly) what the OP wants – C.Robin Aug 18 '21 at 22:19