11

I am trying to do something very similar to Scale relative to a value in each group (via dplyr) (however this solution seems to crash R for me). I would like to replicate a single value for each group and add a new column with this value repeated. As an example I have

library(dplyr)

data = expand.grid(
  category = LETTERS[1:2],
  year = 2000:2003)
data$value = runif(nrow(data))

data

  category year     value
1        A 2000 0.6278798
2        B 2000 0.6112281
3        A 2001 0.2170495
4        B 2001 0.6454874
5        A 2002 0.9234604
6        B 2002 0.9311204
7        A 2003 0.5387899
8        B 2003 0.5573527

And I would like a dataframe like

data

  category year     value    value2
1        A 2000 0.6278798 0.6278798
2        B 2000 0.6112281 0.6112281
3        A 2001 0.2170495 0.6278798
4        B 2001 0.6454874 0.6112281
5        A 2002 0.9234604 0.6278798
6        B 2002 0.9311204 0.6112281
7        A 2003 0.5387899 0.6278798
8        B 2003 0.5573527 0.6112281

i.e. the value for each category is the value from year 2000. I was trying to think of a general solution extensible to a given filtering criteria, i.e. something like

data %>% group_by(category) %>% mutate(value = filter(data, year==2002))

however this does not work because of incorrect length in the assignment.

Community
  • 1
  • 1
mgilbert
  • 3,495
  • 4
  • 22
  • 39

1 Answers1

16

Do this:

data %>% group_by(category) %>%
  mutate(value2 = value[year == 2000])

You could also do it this way:

data %>% group_by(category) %>%
  arrange(year) %>%
  mutate(value2 = value[1])

or

data %>% group_by(category) %>%
  arrange(year) %>%
  mutate(value2 = first(value))

or

data %>% group_by(category) %>%
  mutate(value2 = nth(value, n = 1, order_by = "year"))

or probably several other ways.

Your attempt with mutate(value = filter(data, year==2002)) doesn't make sense for a few reasons.

  1. When you explicitly pass in data again, it's not part of the chain that got grouped earlier, so it doesn't know about the grouping.

  2. All dplyr verbs take a data frame as first argument and return a data frame, including filter. When you do value = filter(...) you're trying to assign a full data frame to the single column value.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • ahh okay, yes I knew there was something fishy about passing in data into filter() again but could not think of a way to do this otherwise. In your first example am I correct in assuming under the hood something of the form data[data$year==2002,] is happening and then since this is within the context of a group it is aware how to broadcast these values? – mgilbert Dec 03 '15 at 21:05
  • 1
    When things are grouped, think of it like you have an individual data frame for each group, so it's starting with `sub_df = data[data$category == "A"]`. From there, `dplyr` knows the column names, so `value[year == 2000]` it knows to look inside `sub_df` for `year == 2000`, which will returns a boolean vector, TRUE for the rows when year is 2000. It subsets `value`, which is a corresponding vector of `value` based on the boolean vector we created with `year == 2000`. – Gregor Thomas Dec 03 '15 at 21:13
  • 1
    Data table does this more explicitly, referring to the sub-data-frames by `.SD` (stands for **s**ub **d**ata table). – Gregor Thomas Dec 03 '15 at 21:14