1

Consider a data frame of the form

       idnum      start        end var1 var2 var3 var4 
1993.1    17 1993-01-01 1993-12-31  foo  bar    1    2 

with start and end being of type Date

 $ idnum : int  17 17 17 17 27 27
 $ start : Date, format: "1993-01-01" "1993-01-01" "1993-01-01" "1993-01-01" ...
 $ end   : Date, format: "1993-12-31" "1993-12-31" "1993-12-31" "1993-12-31" ...

I would like to create a new dataframe, that has instead monthly observations for every row, for every month in between start and end (including the boundaries):

Desired Output

idnum       month var1 var2 var3 var4 
   17  1993-01-01  foo  bar    1    2
   17  1993-02-01  foo  bar    1    2
...
   17  1993-12-01  foo  bar    1    2

I was suggested

require(data.table) ## 1.9.2+
setDT(df)[, list(idnum=idnum, month=seq(start,end,by="month")), by=1:nrow(df)]

However, I have a long list of additional columns that I also want to move with me (basically all the columns inside df besides start, end. Is there an elegant way of providing these additional columns? My naive approach was to replace idnum=idnum with colnames(df), which did not work.

Update

I tried, as suggested (since I wanted the code to be robust to changes of order in columns, I adjusted it slightly)

columnNames = colnames(df)[colnames(df) != 'start' & colnames(df) != 'end']
require(data.table) 
test <- data.frame(df)
setDT(test)
result <- test[, list( month=seq(start,end,by="month")), by=eval(columnNames) ]

but I got an

Error in seq.Date(start, end, by = "month") : 'from' must be of length 1
Community
  • 1
  • 1
FooBar
  • 15,724
  • 19
  • 82
  • 171
  • The problem here as I see it is that `data.table` doesn't know how exactly to expand your other variables, thus you'll need to set some type of rules – David Arenburg Jul 22 '14 at 14:41
  • The other variables should just be copied among the expansion: `var1=var1, var2=var2,...`. I'm looking for a nice shorthand for that here. – FooBar Jul 22 '14 at 14:44
  • Can you provide some more columns to your example data and the desired output? – David Arenburg Jul 22 '14 at 14:46
  • I added some columns, I hope that removes confusion about my goals. – FooBar Jul 22 '14 at 14:52

1 Answers1

1

Put the columns in by=list(...)

library(data.table)
df <- data.frame(idnum=17, 
                 start=as.Date("1993-01-01"), end=as.Date("1993-12-31"),
                 var1="foo",var2="bar",var3=1,var4=2)
setDT(df)
result <-df[,list(month=seq(start,end,by="month")), by=list(idnum,var1,var2,var3,var4)]
head(result)
#    idnum var1 var2 var3 var4      month
# 1:    17  foo  bar    1    2 1993-01-01
# 2:    17  foo  bar    1    2 1993-02-01
# 3:    17  foo  bar    1    2 1993-03-01
# 4:    17  foo  bar    1    2 1993-04-01
# 5:    17  foo  bar    1    2 1993-05-01
# 6:    17  foo  bar    1    2 1993-06-01

If you have a great many columns and don't want to enumerate them, this will include all columns except 2:3: (produces the same result above with this dataset).

result <-df[,list(month=seq(start,end,by="month")), by=eval(names(df)[-(2:3)])]
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • I tried running this on my dataset, but I got an error - I updated my question. – FooBar Jul 23 '14 at 12:36
  • Can you upload your dataset somewhere and post a link in your question? Absent that, it looks like you may have duplicate rows; that is, for a given combination of everything but start and end, you have multiple rows. – jlhoward Jul 23 '14 at 16:43
  • Great call, there were duplicate rows. – FooBar Jul 24 '14 at 12:31