3

I'm trying to calculate difference/duration between the first and n rows of a dataframe that match in one column. I want to place that value in a new column "duration". Sample data: below.

y <- data.frame(c("USA", "USA", "USA", "France", "France", "Mexico", "Mexico", "Mexico"), c(1992, 1993, 1994, 1989, 1990, 1999, 2000, 2001))
colnames(y) <- c("Country", "Year")
y$Year <- as.integer(y$Year) # this is to match the class of my actual data

My desired result is:

1992    USA 0
1993    USA 1
1994    USA 2
1989    France  0
1990    France  1
1999    Mexico  0
2000    Mexico  1
2001    Mexico  2

I've tried using dplyr's group_by and mutate

y <- y %>% group_by(Country) %>% mutate(duration = Year - lag(Year)) 

but I can only get the actual lag year (e.g. 1999) or only calculate the difference between sequential rows getting me either NA for the first row of a country or 1 for all other rows with the same country. Many q & a's focus on difference between sequential rows and not between the first and n rows.

Thoughts?

Community
  • 1
  • 1
  • You can do this `with(y, ave(seq_along(Year), Country, FUN = seq_along)-1)` or if it is the difference `y %>% group_by(Country) %>% mutate(duration = Year - first(Year))` – akrun Oct 08 '16 at 14:05
  • Thanks, @akrun! This is great. I'm new-ish to R and spent a few hours trying to figure it out but must have been searching for the wrong terms/processes. Thanks again! – user6780841 Oct 08 '16 at 14:12

1 Answers1

2

This can be done by subtracting the first 'Year' with the 'Year' column after grouping by 'Country'.

y %>%
   group_by(Country) %>%
   mutate(duration = Year - first(Year))
akrun
  • 874,273
  • 37
  • 540
  • 662