-6

I have a data frame with a list of cities and daily temperature recordings

data = data.frame(c("Chicago", "Chicago", "New York", "New York", "Denver"),
                  c(25, 36, 23, 24, 42))

I want to add a third column that is the average temperature by city

avgtemp = c(30.5, 30.5, 23.5, 23.5, 42)

I have tried to do this using the package dplyr, but have not had success. What is the best way to achieve this, noting that the full dataset contains 50,000 rows, thus I want the code to be efficient.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
user3725021
  • 566
  • 3
  • 14
  • 32
  • 4
    What is the meaning of "but have not had success"? Please show attempts, error message and so on. –  Dec 08 '15 at 11:27
  • 4
    Try `ave(data[, 2], data[, 1])` if the values are actually numeric. @CathG fixed to match their desired output. – David Arenburg Dec 08 '15 at 11:32
  • 1
    If the data are large and performance is an issue, you could try data.table – Heroka Dec 08 '15 at 11:35
  • If you want to do it with `dplyr`, you could use `group_by` in combination with `mutate` – Jaap Dec 08 '15 at 11:52

1 Answers1

3

I think what you are looking for (if you want to use dplyr) is a combination of the functions group_byand mutate.

library(dplyr)
city <- c("a", "a", "b", "b", "c")
temp <- 1:5
df <- data.frame(city, temp)

df %>% group_by(city) %>% mutate(mean(temp))

Which would output:

    city  temp mean(temp)
  (fctr) (int)      (dbl)
1      a     1        1.5
2      a     2        1.5
3      b     3        3.5
4      b     4        3.5
5      c     5        5.0

On a side note, I do not think 50,000 rows is that big of a data set for dplyr. I would not worry too much unless this code is going to be inside some kind of loop or you have 1M+ rows. As Heroka sugested in the comments, data.table is a better alternative when it comes to performance in most cases.

Edit: removed unnecessary step

leosz
  • 721
  • 5
  • 13
  • This is good, but how do I add the average to data frame? – user3725021 Dec 08 '15 at 12:20
  • The `mutate(mean(temp))` part is the one that creates the additional mean column. I am not sure if I understand what you really mean. The original data frame is not affected because there was no assignment. The code example is just printing the result. `df <- df %>% group_by(city) %>% mutate(mean(temp))` – leosz Dec 08 '15 at 13:05
  • if i want to append this column to the original dataframe, how would i do that? – user3725021 Dec 08 '15 at 19:30
  • `df <- df %>% group_by(city) %>% mutate(mean(temp))` – leosz Dec 09 '15 at 07:54