12

I have a set of time series, and I want to scale each of them relative to their value in a specific interval. That way, each series will be at 1.0 at that time and change proportionally.

I can't figure out how to do that with dplyr.

Here's a working example using a for loop:

library(dplyr)

data = expand.grid(
  category = LETTERS[1:3],
  year = 2000:2005)
data$value = runif(nrow(data))

# the first time point in the series
baseYear = 2002

# for each category, divide all the values by the category's value in the base year
for(category in as.character(levels(factor(data$category)))) {
  data[data$category == category,]$value = data[data$category == category,]$value / data[data$category == category & data$year == baseYear,]$value[[1]]
}

Edit: Modified the question such that the base time point is not indexable. Sometimes the "time" column is actually a factor, which isn't necessarily ordinal.

sharoz
  • 6,157
  • 7
  • 31
  • 57

3 Answers3

13

This solution is very similar to @thelatemail, but I think it's sufficiently different enough to merit its own answer because it chooses the index based on a condition:

data %>%
    group_by(category) %>%
    mutate(value = value/value[year == baseYear])

#   category  year      value
#...     ...   ...       ...
#7         A  2002 1.00000000
#8         B  2002 1.00000000
#9         C  2002 1.00000000
#10        A  2003 0.86462789
#11        B  2003 1.07217943
#12        C  2003 0.82209897

(Data output has been truncated. To replicate these results, set.seed(123) when creating data.)

divibisan
  • 11,659
  • 11
  • 40
  • 58
oshun
  • 2,319
  • 18
  • 32
10

Use first in dplyr, ensuring you use order_by

data %>% 
  group_by(category) %>% 
  mutate(value = value / first(value, order_by = year))
Hugh
  • 15,521
  • 12
  • 57
  • 100
  • Thanks! What if I want to scale by the median year, or a scenario where indexing wouldn't work? – sharoz Nov 25 '14 at 02:24
  • `mutate(valute = value/median(value))` use `group_by` to select the subset the median should be taken over. – Hugh Nov 25 '14 at 02:35
  • That's the media value. I'm asking what if it scales by the value of the median year (or a specific year). – sharoz Nov 25 '14 at 03:01
4

Something like this:

data %>% 
  group_by(category) %>% 
  mutate(value=value/value[1]) %>%
  arrange(category,year)

Result:

#   category year     value
#1         A 2000 1.0000000
#2         A 2001 0.2882984
#3         A 2002 1.5224308
#4         A 2003 0.8369343
#5         A 2004 2.0868684
#6         A 2005 0.2196814
#7         B 2000 1.0000000
#8         B 2001 0.5952027
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Another way would be to use `first` and `order_by`. The minimum year should be in the first row for each category by using `order_by`. `mutate(group_by(data, category), out = order_by(year, value / first(value))) %>% arrange(category)` – jazzurro Nov 25 '14 at 02:00
  • Thanks! What if I want to scale by the median year, where indexing wouldn't work? – sharoz Nov 25 '14 at 02:02
  • @sharoz Do you mean `mutate(value=value/median(year))` – David Robinson Nov 25 '14 at 11:34
  • @DavidRobinson I meant the value at the median year – sharoz Nov 25 '14 at 12:16
  • 1
    @sharoz: Ah, understood. You can use the `approx` function, which interpolates (linearly) one vector based on another. Try `data %>% group_by(category) %>% arrange(category, year) %>% mutate(value = value / approx(year, value, median(year))$y)` – David Robinson Nov 25 '14 at 19:26
  • I'm not sure what that did, but it's not what I was after. All the values for 2002 should be 1. – sharoz Nov 25 '14 at 22:46
  • @sharoz - `mutate(value=value/value[year == round(median(year))])` maybe? – thelatemail Nov 26 '14 at 00:46
  • @sharoz: 2002 is *not* the median year of 2000-2005. The median is 2002.5. If you filtered out 2005 first you would indeed get all values for 2002 being 1. – David Robinson Nov 26 '14 at 01:31