1

I have a data set that looks like the following that I'd like to expand to a monthly panel data set.

ID | start_date | end_date | event_type |
 1 |   01/01/97 | 08/01/98 |          1 |
 2 |   02/01/97 | 10/01/97 |          1 |
 3 |   01/01/96 | 12/01/04 |          2 |

Some cases last longer than others. I've figured out how to expand the data to a yearly configuration by pulling out the year from each date and then using:

year <- ddply(df, c("ID"), summarize, year = seq(startyear, endyear))

followed by:

month <- ddply(year, c("ID"), summarize, month = seq(1, 12))

The problem with this approach is that it doesn't assign the correct number for the month, i.e. January = 1, and so it doesn't play well with an event data set that I would like to eventually merge it with, where I would be matching on year, ID, and month. Help would be appreciated. Here is a direct link to the data set I am trying to expand (.xls): http://db.tt/KeLRCzr9. Hopefully I've included enough information, but please let me know if there is any other information needed.

Zach
  • 996
  • 12
  • 25

2 Answers2

1

You could try something more like this:

ddply(df,.(ID),transform,dt = seq.Date(as.Date(start_date,"%m/%d/%Y"),as.Date(end_date,"%m/%d/%Y"),by = "month"))

There will probably be a lot of warnings having to do with the row names, and I can't guarantee that this will work, since the data set you link to does not match the example you provide. For starters, I'm assuming that you cleaned up the start and end dates, since they appear in various formats in the .xls file.

joran
  • 169,992
  • 32
  • 429
  • 468
  • I don't understand what is different, other than the names ofc. The columns I am actually using are `ConflEp`, `EpStartDate`, and `EpEndDate`. Yes I did clean up the dates, which are now all in a `%Y-%m-%d` format. Here is a link to the cleaned version, sorry I didn't think of that. http://db.tt/KeLRCzr9 I got this error after running that: Error in seq.int(r1$mon, 12 * (to0$year - r1$year) + to0$mon, by) : 'to' must be finite – Zach Apr 09 '12 at 00:27
  • @Zach Did you adjust the date format? I wrote that code based on the example you provided, with the dates in %m/%d/%Y format. – joran Apr 09 '12 at 00:43
  • Yes I did. I converted them to "%Y-%m-%d" and then ran this: test <- ddply(data1,.(ConflEp), transform, dt = seq.Date(EpStartDate, EpEndDate, by = "month")). I got the "to" must be finite error and a warning that said "row names were found from a short variable and have been discarded" – Zach Apr 09 '12 at 00:48
  • @Zach Well I can't help any further until you provide a reproducible example. The data set you link to has no ID column. The ID column in your example is unique, but all the potentially equivalent columns in the xls appear to not be unique. Reconcile these differences, and then I might look at it again. – joran Apr 09 '12 at 00:50
  • The column `ConflEp` is unique, which is the ID column I am using. There is no repetition of that in the data set I linked to. One row per ID. – Zach Apr 09 '12 at 00:52
  • In that case the code I gave you works, with the exception that the data you link to has not had its dates cleaned up completely. Many are missing or a year range like 1976-1984. I'm done with this, sorry. – joran Apr 09 '12 at 00:54
0
ddply(df, .(ID), summarize, dt = seq.Date(start_date, end_date, by = "month"))

Assuming start_date and end_date are date objects already. Joran got me close though, so again, thanks for the help on that.

Zach
  • 996
  • 12
  • 25