2

I'm looking for a faster way to calculate a group mean with multiple grouping variables while excluding own group values. A thought experiment would be finding average value (e.g. price) for a county from the counties in the same state in the same year excluding own county's value. Here's a toy data set.

df <- data_frame(
  state = rep(c("AL", "CA"), each = 6),
  county = rep(letters[1:6], each = 2),
  year = rep(c(2011:2012), 6),
  value = sample.int(100, 12)
)

df %>%
  group_by(state, county, year) %>%
  summarise(q = mean(df$value[df$state == state & df$county != county & df$year == year]))

# Groups:   state, county [6]
   state county  year     q
   <chr> <chr>  <int> <dbl>
 1 AL    a       2011  56  
 2 AL    a       2012  46  
 3 AL    b       2011  50.5
 4 AL    b       2012  52  
 5 AL    c       2011  55.5
 6 AL    c       2012  29  
 7 CA    d       2011  52.5
 8 CA    d       2012  32  
 9 CA    e       2011  68.5
10 CA    e       2012  31.5
11 CA    f       2011  32  
12 CA    f       2012  42.5

The above code gives me a desired result, but when I apply this to a larger dataset (with more grouping variables) it gets really slow. Do you have any suggestion on how to speed this up?

If the original approach is incorrect, please point that out as well.

M--
  • 25,431
  • 8
  • 61
  • 93
qnp1521
  • 806
  • 6
  • 20
  • 1
    The `summarise` call looks weird. There shouldn't be a need to use the full `df` and for all the `$`-indexing. I'm not entirely clear on what you'd like to summarise. Can you include your expected output? – Maurits Evers Jan 16 '20 at 22:48
  • I did so building on https://stackoverflow.com/questions/36450278/summarize-with-dplyr-other-then-groups this post. Is it a wrong approach? – qnp1521 Jan 16 '20 at 22:51
  • Not sure (yet); can you include the expected output for the sample you give? You group by `state`, `county` and `year` but as as far as I can see there are only unique entries per `state`, `county` and `year` so there's nothing to summarise. – Maurits Evers Jan 16 '20 at 22:52
  • That's an odd use of `summarise`; you're actually not summarising anything, with the dimension of your expected output being the same as your input. It seems a `mutate` call would be more fitting here. – Maurits Evers Jan 16 '20 at 22:56
  • Aha! Yea it's not really "summarizing" in that sense, but is more like "replacing" the original value with means from other counties within the same [state] and [year]. Yea, we can do mutate as well. But a harder question for me is what should go inside the mutate (or summarise). – qnp1521 Jan 16 '20 at 22:58
  • 1
    Not a bad idea to make your examples that are random with setting a seed (e.g. `set.seed(123)`) so the results are reproducible. Cheers. – M-- Jan 16 '20 at 23:16

2 Answers2

2

A more efficient approach would be to sum the 'value's after grouping by 'state', 'year', subtract from the 'value', then divide the n() - 1 observations

library(dplyr)   
library(purrr) 
out2 <- df %>%
          group_split(state, year) %>%
          map_dfr(~ .x %>% 
                 mutate(q = (sum(value) - value)/(n()-1))) %>%
          select(-value) %>%
          arrange(state, county)

-checking with the OP's output ('out1')

all.equal(out2, out1, check.attributes = FALSE)
#[1] TRUE
akrun
  • 874,273
  • 37
  • 540
  • 662
2
library(dplyr)

df %>%
  group_by(state, year) %>%
  mutate(q = (sum(value) - value) / (n()-1))

#> # A tibble: 12 x 5
#> # Groups:   state, year [4]
#>    state county  year value     q
#>    <chr> <chr>  <int> <int> <dbl>
#>  1 AL    a       2011    68  30.5
#>  2 AL    a       2012    63  42  
#>  3 AL    b       2011    53  38  
#>  4 AL    b       2012    56  45.5
#>  5 AL    c       2011     8  60.5
#>  6 AL    c       2012    28  59.5
#>  7 CA    d       2011     7  40  
#>  8 CA    d       2012    69  41  
#>  9 CA    e       2011    39  24  
#> 10 CA    e       2012    79  36  
#> 11 CA    f       2011    41  23  
#> 12 CA    f       2012     3  74

Data:

#data_frame is deprecate!
df <- tibble(
  state = rep(c("AL", "CA"), each = 6),
  county = rep(letters[1:6], each = 2),
  year = rep(c(2011:2012), 6),
  value = sample.int(100, 12)
)
M--
  • 25,431
  • 8
  • 61
  • 93