47

Consider a data frame of the form

       idnum      start        end
1993.1    17 1993-01-01 1993-12-31
1993.2    17 1993-01-01 1993-12-31
1993.3    17 1993-01-01 1993-12-31

with start and end being of type Date

 $ idnum : int  17 17 17 17 27 27
 $ start : Date, format: "1993-01-01" "1993-01-01" "1993-01-01" "1993-01-01" ...
 $ end   : Date, format: "1993-12-31" "1993-12-31" "1993-12-31" "1993-12-31" ...

I would like to create a new dataframe, that has instead monthly observations for every row, for every month in between start and end (including the boundaries):

Desired Output

idnum       month
   17  1993-01-01
   17  1993-02-01
   17  1993-03-01
...
   17  1993-11-01
   17  1993-12-01

I'm not sure what format month should have, I will at some point want to group by idnum, month for regressions on the rest of the data set.

So far, for every single row, seq(from=test[1,'start'], to=test[1, 'end'], by='1 month') gives me the right sequence - but as soon as I try to apply that to the whole data frame, it will not work:

> foo <- apply(test, 1, function(x) seq(x['start'], to=x['end'], by='1 month'))
Error in to - from : non-numeric argument to binary operator
zx8754
  • 52,746
  • 12
  • 114
  • 209
FooBar
  • 15,724
  • 19
  • 82
  • 171
  • As a beginner in `R`, how am I supposed to judge the answers? Is there a way to check them for efficiency, as `%timeit` in Python? – FooBar Jul 17 '14 at 13:51
  • Related: [Expand ranges defined by "from" and "to" columns](https://stackoverflow.com/questions/11494511/expand-ranges-defined-by-from-and-to-columns) – Henrik Jan 23 '23 at 22:10

7 Answers7

45

Using data.table:

require(data.table) ## 1.9.2+
setDT(df)[ , list(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

# you may use dot notation as a shorthand alias of list in j:
setDT(df)[ , .(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

setDT converts df to a data.table. Then for each row, by = 1:nrow(df), we create idnum and month as required.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 3
    The most efficient answer as far as I can tell. A short follow-up: say I'd have actually a long list of columns that I want in the new dataframe, not just `idnum`. Is there an elegant way of providing these? Replacing `idnum=idnum` with `colnames(df)` surely won't work. – FooBar Jul 18 '14 at 13:06
  • On a smallish dataset of about 40k records, this is 25x faster than the dplyr::rowwise() option. – Jacob Nov 04 '16 at 16:18
  • 4
    How to use multiple columns in place of idnum ? – jeganathan velu Apr 26 '19 at 10:01
  • @jeganathanvelu better to ask as a separate question. – Arun May 06 '19 at 00:04
26

Using dplyr :

test %>%
    group_by(idnum) %>%
    summarize(start=min(start),end=max(end)) %>%
    do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))

Note that here I don't generate a sequence between start and end for each row, instead it is a sequence between min(start) and max(end) for each idnum. If you want the former :

test %>%
    rowwise() %>%
    do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))
juba
  • 47,631
  • 14
  • 113
  • 118
11

Updated2

With new versions of purrr (0.3.0) and dplyr (0.8.0), this can be done with map2

library(dplyr)
library(purrr)
 test %>%
     # sequence of monthly dates for each corresponding start, end elements
     transmute(idnum, month = map2(start, end, seq, by = "1 month")) %>%
     # unnest the list column
     unnest %>% 
     # remove any duplicate rows
     distinct

Updated

Based on @Ananda Mahto's comments

 res1 <- melt(setNames(lapply(1:nrow(test), function(x) seq(test[x, "start"],
 test[x, "end"], by = "1 month")), test$idnum))

Also,

  res2 <- setNames(do.call(`rbind`,
          with(test, 
          Map(`expand.grid`,idnum,
          Map(`seq`, start, end, by='1 month')))), c("idnum", "month"))


  head(res1)
 #  idnum      month
 #1    17 1993-01-01
 #2    17 1993-02-01
 #3    17 1993-03-01
 #4    17 1993-04-01
 #5    17 1993-05-01
 #6    17 1993-06-01
akrun
  • 874,273
  • 37
  • 540
  • 662
  • +1. I had done `melt(setNames(lapply(1:nrow(test), function(x) seq(test[x, "start"], test[x, "end"], by = "1 month")), test$idnum))` to avoid calling `data.frame` unnecessarily. – A5C1D2H2I1M1N2O1R2T1 Jul 17 '14 at 13:49
  • If all these methods work with my R version, how do I chose one? I am a complete beginner here... are some of these methods better generalizable to similar solutions, or newer and less likely to be deprecated? Is there a performance routine I could use to check them? – FooBar Jul 17 '14 at 13:53
  • @Ananda Mahto. Thanks I replaced my code with yours. – akrun Jul 17 '14 at 16:39
  • @FooBar, part personal preference, part "what code will I be able to understand 6 months from now?", part "how big is my data?" There are a lot of different reasons to pick one approach over the other. The "microbenchmark" package helps you figure out which approaches are most efficient in terms of computing time. – A5C1D2H2I1M1N2O1R2T1 Jul 17 '14 at 16:41
  • @FooBar, For me, if the datasets are considerably big, in general, `dplyr` or `data.table` based solutions would be faster. It is difficult to predict which one to be deprecated. – akrun Jul 17 '14 at 17:05
8

One option creating a sequence per every row using dplyr and tidyr could be:

df %>%
 rowwise() %>%
 transmute(idnum,
           date = list(seq(start, end, by = "month"))) %>%
 unnest(date)

  idnum date      
   <int> <date>    
 1    17 1993-01-01
 2    17 1993-02-01
 3    17 1993-03-01
 4    17 1993-04-01
 5    17 1993-05-01
 6    17 1993-06-01
 7    17 1993-07-01
 8    17 1993-08-01
 9    17 1993-09-01
10    17 1993-10-01
# … with 26 more rows

Or creating the sequence using a grouping ID:

df %>%
 group_by(idnum) %>%
 transmute(date = list(seq(min(start), max(end), by = "month"))) %>%
 unnest(date)

Or when the goal is to create only one unique sequence per ID:

df %>%
 group_by(idnum) %>%
 summarise(start = min(start),
           end = max(end)) %>%
 transmute(date = list(seq(min(start), max(end), by = "month"))) %>%
 unnest(date)

   date      
   <date>    
 1 1993-01-01
 2 1993-02-01
 3 1993-03-01
 4 1993-04-01
 5 1993-05-01
 6 1993-06-01
 7 1993-07-01
 8 1993-08-01
 9 1993-09-01
10 1993-10-01
11 1993-11-01
12 1993-12-01
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
4

tidyverse answer

Data

df <- structure(list(idnum = c(17L, 17L, 17L), start = structure(c(8401, 
8401, 8401), class = "Date"), end = structure(c(8765, 8765, 8765
), class = "Date")), class = "data.frame", .Names = c("idnum", 
"start", "end"), row.names = c(NA, -3L))

Answer and output

library(tidyverse)
df %>%
  nest(start, end) %>%
  mutate(data = map(data, ~seq(unique(.x$start), unique(.x$end), 1))) %>%
  unnest(data)

# # A tibble: 365 x 2
   # idnum       data
   # <int>     <date>
 # 1    17 1993-01-01
 # 2    17 1993-01-02
 # 3    17 1993-01-03
 # 4    17 1993-01-04
 # 5    17 1993-01-05
 # 6    17 1993-01-06
 # 7    17 1993-01-07
 # 8    17 1993-01-08
 # 9    17 1993-01-09
# 10    17 1993-01-10
# # ... with 355 more rows
CPak
  • 13,260
  • 3
  • 30
  • 48
  • dplyr ver `0.7.4` gives `Error: Each column must either be a list of vectors or a list of data frames [data]` – Hedgehog Jun 25 '18 at 23:31
4

And yet another tidyverse approach would be to use tidyr::expand:

library(dplyr, warn = FALSE)
library(tidyr)

df |> 
  mutate(
    row = row_number()
  ) |> 
  group_by(row) |> 
  expand(idnum, date = seq(start, end, "month")) |> 
  ungroup() |> 
  select(-row)
#> # A tibble: 36 × 2
#>    idnum date      
#>    <int> <date>    
#>  1    17 1993-01-01
#>  2    17 1993-02-01
#>  3    17 1993-03-01
#>  4    17 1993-04-01
#>  5    17 1993-05-01
#>  6    17 1993-06-01
#>  7    17 1993-07-01
#>  8    17 1993-08-01
#>  9    17 1993-09-01
#> 10    17 1993-10-01
#> # … with 26 more rows
stefan
  • 90,330
  • 6
  • 25
  • 51
0

A vectorised solution which utilises lubridate for the month calculations.

time_seq_v() is a vectorised version of seq() specifically for date and datetime calculations.

library(lubridate)
library(data.table)
# remotes::install_github("NicChr/timeplyr")
library(timeplyr)
df <- data.frame(idnum = c(1993.1, 1993.2, 1993.3),
                 start = ymd(rep(19930101, 3)),
                 end = ymd(rep(19931231, 3)))
setDT(df)
df[, list(month = time_seq_v(start, end, by = "month"))]
#>          month
#>  1: 1993-01-01
#>  2: 1993-02-01
#>  3: 1993-03-01
#>  4: 1993-04-01
#>  5: 1993-05-01
#>  6: 1993-06-01
#>  7: 1993-07-01
#>  8: 1993-08-01
#>  9: 1993-09-01
#> 10: 1993-10-01
#> 11: 1993-11-01
#> 12: 1993-12-01
#> 13: 1993-01-01
#> 14: 1993-02-01
#> 15: 1993-03-01
#> 16: 1993-04-01
#> 17: 1993-05-01
#> 18: 1993-06-01
#> 19: 1993-07-01
#> 20: 1993-08-01
#> 21: 1993-09-01
#> 22: 1993-10-01
#> 23: 1993-11-01
#> 24: 1993-12-01
#> 25: 1993-01-01
#> 26: 1993-02-01
#> 27: 1993-03-01
#> 28: 1993-04-01
#> 29: 1993-05-01
#> 30: 1993-06-01
#> 31: 1993-07-01
#> 32: 1993-08-01
#> 33: 1993-09-01
#> 34: 1993-10-01
#> 35: 1993-11-01
#> 36: 1993-12-01
#>          month

Created on 2023-05-16 with reprex v2.0.2

NicChr
  • 858
  • 1
  • 9