-1

Hey Ive just started learning R. I wanted to ask how i can find the maximum value in a vector based for combination of 2 vectors in a dataframe. For instance, I want to find the five maximum avg price . I have a data frame with about 7 columns, 3 of them being average price, city and year. The year is obviously recurrent and ranges from 2000 to 2009. The data also has various NA's in different columns and rows.

The Data Frame looks something like this:
avgprice    year    city
12          2000    Jersey City
45          2001    Hoboken   
NA          2000    Hoboken
34          2000    evanston
67          2001    abilene
89          2001    elizabeth
45          2000    delhi       
98          2000    delhi

I want the answer to look like
year    city           avgprice
2000    jersey city     12
2000    evanston        34
2000    delhi           98
2001    hoboken         45
2001    elizabeth       89
2001    abilene         67

I tried with the following code but I think I'm messing up something. I tried to create df using split by year and then used an aggregate function.

df<-split(tx.house.sales, tx.house.sales$year)
re<-aggregate(avgprice~city, df, FUN=max)

Thank you :)

chandini v
  • 11
  • 4

2 Answers2

4

Here are some simple options (assuming df is the original data (not splitted)). The reason that aggregate works smoothly is because it has na.action = na.omit as default, which suites your exact needs.

aggregate(avgprice ~ city + year, df, max)
#          city year avgprice
# 1       delhi 2000       98
# 2    evanston 2000       34
# 3 Jersey City 2000       12
# 4     abilene 2001       67
# 5   elizabeth 2001       89
# 6     Hoboken 2001       45

Or

library(data.table)
setDT(na.omit(df))[, .(avgprice = max(avgprice)), keyby = .(year, city)]

Or

library(dplyr)
na.omit(df) %>%
  group_by(city, year) %>%
  summarise(avgprice = max(avgprice))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • thank you so much.. Also can you suggest what could be wrong in the following code: library(dplyr) df %>% group_by(city,year) %>% do(.[which.max(.$avgprice),]) – chandini v Feb 24 '15 at 22:56
1

Using dplyr you could do:

library(dplyr)
df %>% group_by(city,year) %>% slice(which.max(avgprice))

#  avgprice year       city
#1       67 2001    abilene
#2       98 2000      delhi
#3       89 2001  elizabeth
#4       34 2000   evanston
#5       45 2001    Hoboken
#6       12 2000 JerseyCity

Updated following @docendo discimus's advice

NicE
  • 21,165
  • 3
  • 51
  • 68
  • You could use `df %>% group_by(city,year) %>% slice(which.max(avgprice))` for better performance – talat Feb 24 '15 at 22:49