0

Case:
Given are a dataframe df with (among others) a startDate and an endDate column. My goal is to "transform" df to a new dataframe df_res that contains one row for every month occuring in the interval (startDate, endDate), with additional columns year, month and numberOfDaysInMonth, all of type int.

Example:
Input: df

  id    startDate     endDate  someOtherCol
   1   2017-09-23  2018-02-01          val1
   2   2018-01-01  2018-03-31          val2
 ...          ...         ...           ...

Desired output: df_res

  id  year  month  numberOfDaysInMonth  someOtherCol
   1  2017      9                    8          val1
   1  2017     10                   31          val1
   1  2017     11                   30          val1
   1  2017     12                   31          val1
   1  2018      1                   31          val1
   1  2018      2                    1          val1
   2  2018      1                   31          val2
   2  2018      2                   28          val2
   2  2018      3                   31          val2
 ...   ...    ...                  ...           ... 

Background:
I am relatively new to R but aware of the great dplyrand lubridate packages. I just did not manage to achieve the above in a neat way, even when using those packages. The closest I got was this: Expand rows by date range using start and end date, but that does not yield the number of days per month contained in the range.

Any help is greatly appreciated.

KvA
  • 3
  • 3
  • Welcome to SO! Please read [ask]! Do not use images to present your data, read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ! – jogo Aug 15 '18 at 07:32
  • `seq.Date` can go by `month`, see `?seq.Date`. – MichaelChirico Aug 15 '18 at 07:36
  • Besides that, see https://stackoverflow.com/questions/6243088/find-out-the-number-of-days-of-a-month-in-r – MichaelChirico Aug 15 '18 at 07:38
  • Hello @jogo, thank you for your response. I have reformatted the dataframes. Is it okay like this? – KvA Aug 15 '18 at 07:39
  • Thank you @MichaelChirico. I checked both. I understand how these operate individualy, but do not get how to apply them to get from `df` to `df_res`. Also, the latter finds the number of days in a given month, but I am looking for the number of days in a month that are contained in the interval `(startDate, endDate)` (see example). Do you know how to do just that? – KvA Aug 15 '18 at 07:55

2 Answers2

2

If you do not mind a data.table solution, you can create a sequence of consecutive dates between startDate and endDate before you aggregate by id, someOtherCol, year and month as follows:

dat[, .(Dates=seq(startDate, endDate, by="1 day")), by=.(id, someOtherCol)][,
    .N, by=.(id, someOtherCol, year(Dates), month(Dates))]

output:

   id someOtherCol year month  N
1:  1         val1 2017     9  8
2:  1         val1 2017    10 31
3:  1         val1 2017    11 30
4:  1         val1 2017    12 31
5:  1         val1 2018     1 31
6:  1         val1 2018     2  1
7:  2         val2 2018     1 31
8:  2         val2 2018     2 28
9:  2         val2 2018     3 31

data:

library(data.table)    
dat <- fread("id    startDate     endDate  someOtherCol
1   2017-09-23  2018-02-01          val1
2   2018-01-01  2018-03-31          val2")
datecols <- c("startDate", "endDate")
dat[, (datecols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=datecols]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Thanks for your suggestion @chinsoon12. Although this works fine, I prefer the `tidyverse` solution as it is easier - at least to me - to follow its logic. The computation time constraint in my application will probably not be binding, so I did not compare this method with @AntoniosK 's proposal on computational performance. – KvA Aug 16 '18 at 14:58
2

A tidyverse solution:

# example data
df = read.table(text = "
id    startDate     endDate  someOtherCol
1   2017-09-23  2018-02-01          val1
2   2018-01-01  2018-03-31          val2
", header=T, stringsAsFactors=F)

library(tidyverse)
library(lubridate)


df %>%
  mutate_at(vars(startDate, endDate), ymd) %>%                  # update to date columns (if needed)
  group_by(id) %>%                                              # for each id
  mutate(d = list(seq(startDate, endDate, by="1 day"))) %>%     # create a sequence of dates (as a list)
  unnest() %>%                                                  # unnest data
  group_by(id, year=year(d), month=month(d), someOtherCol) %>%  # group by those variables (while getting year and month of each date in the sequence)
  summarise(numberOfDaysInMonth = n()) %>%                      # count days
  ungroup()                                                     # forget the grouping

# # A tibble: 9 x 5
#      id  year month someOtherCol numberOfDaysInMonth
#   <int> <dbl> <dbl> <chr>                      <int>
# 1     1  2017     9 val1                           8
# 2     1  2017    10 val1                          31
# 3     1  2017    11 val1                          30
# 4     1  2017    12 val1                          31
# 5     1  2018     1 val1                          31
# 6     1  2018     2 val1                           1
# 7     2  2018     1 val2                          31
# 8     2  2018     2 val2                          28
# 9     2  2018     3 val2                          31
AntoniosK
  • 15,991
  • 2
  • 19
  • 32