14

I have a data.frame (say "df") looks like following:

Hospital.Name | State | Mortality.Rate
'hospital_1'   | 'AA'  | 0.2
'hospital_2'   | 'AA'   | 0.3
'hospital_3'   | 'BB'  | 0.3
'hospital_4'   | 'CC'  | 0.5

(The Hospital.Name is unique)

Now I want to order the "Mortality.Rate" group by "State", i.e. order the rate within a certain state. If there is a tie in the rate, then "Hospital.Name" is used for resolve the tie.

The "order()" and "tapply()" functions came to my mind. I coded like this:

tapply(df$Mortality.Rate, df$State, order, df$Hospital.Name, na.last=NA)

However, an error "argument length differ" popped up. When "order" function is applied to a sliced "Rate", the second argument of order (i.e. df$Hospital.Name) is not sliced.

How could I pass the second argument (for resolution a tie in ordering) to tapply() or is there any other approaches?

Zelong
  • 2,476
  • 7
  • 31
  • 51

6 Answers6

15

In base R, you can supply multiple arguments to order() and subsequent arguments are used to break ties in the earlier variables, as in:

df[order(df$State,df$Mortality.Rate,df$Hospital.Name),]
Jthorpe
  • 9,756
  • 2
  • 49
  • 64
10

you can do it in dplyr:

df %>% group_by(State) %>% arrange(Mortality.Rate, Hospital.Name) 
jalapic
  • 13,792
  • 8
  • 57
  • 87
  • Thanks a lot. But I need to stick to base R when finding the resolution (sorry about not mentioning it in my question). I will have a look at this package. Thanks. – Zelong Feb 21 '15 at 19:24
4

You can do this in dplyr. First, some sample data:

library("dplyr")
hospital_name <- sample(c("hospital_1", "hospital_2", "hospital_3"), 10,
                        replace = TRUE)
state <- sample(letters[1:3], 10, replace = TRUE)
mortality_rate <- runif(10)

df <- data_frame(hospital_name, state, mortality_rate)

Group by state, then arrange by columns.

df %>% 
  group_by(state) %>% 
  arrange(mortality_rate, hospital_name)

Producing results like these, where the states are grouped and the mortality rate is sorted within each state.

## Source: local data frame [10 x 3]
## Groups: state
## 
##    hospital_name state mortality_rate
## 1     hospital_1     b     0.15293591
## 2     hospital_1     b     0.37417167
## 3     hospital_1     b     0.54561856
## 4     hospital_3     c     0.02487033
## 5     hospital_1     c     0.09937557
## 6     hospital_1     c     0.35666087
## 7     hospital_3     c     0.39663460
## 8     hospital_2     c     0.53064144
## 9     hospital_3     c     0.76015632
## 10    hospital_3     c     0.76801890

Without group_by() you just get the mortality rates from least to greatest:

df %>%
  arrange(mortality_rate)

## Source: local data frame [10 x 3]
## 
##    hospital_name state mortality_rate
## 1     hospital_3     c     0.02487033
## 2     hospital_1     c     0.09937557
## 3     hospital_1     b     0.15293591
## 4     hospital_1     c     0.35666087
## 5     hospital_1     b     0.37417167
## 6     hospital_3     c     0.39663460
## 7     hospital_2     c     0.53064144
## 8     hospital_1     b     0.54561856
## 9     hospital_3     c     0.76015632
## 10    hospital_3     c     0.76801890
Lincoln Mullen
  • 6,257
  • 4
  • 27
  • 30
  • 2
    Here also the answer is similar to @jalapic. I don't know whether the group_by is needed here `arrange(df, State, Hospital.Name, Mortality.Rate)` – akrun Feb 21 '15 at 18:48
  • Yes, the `group_by` is needed to sort within states, rather than within the data frame as a whole. See `?dplyr::group_by`. – Lincoln Mullen Feb 21 '15 at 18:50
  • 1
    Can you show some examples where this will differ. I tried your example with a `set.seed(24)`. Got the same output with or without groupby – akrun Feb 21 '15 at 18:50
  • Edited the answer as you suggest. – Lincoln Mullen Feb 21 '15 at 18:54
  • My code was `arrange(df, state, hospital_name, mortality_rate)` – akrun Feb 21 '15 at 18:54
  • Yes, that will also work for sorting. But using `group_by()` is a better match conceptually for what the question is asking, and permits further analysis, such as taking the top n within a grouping. – Lincoln Mullen Feb 21 '15 at 18:56
  • 1
    I thought using only `arrange` would be faster if the OP needs just to order – akrun Feb 21 '15 at 18:57
3

If we already in loading needles (for this specific operation) packages, here's a package (data.table) that could be useful in a sense of sorting the data by reference (without copying it and the need of using <-) using the setorder or setkey functions

library(data.table)
setorder(setDT(df), State, Mortality.Rate, Hospital.Name)

Though, you could potentially mimic base R syntax and order the data while creating a copy (though with improved speed because data.table calls its forder under the hood)

setDT(df)[order(State, Mortality.Rate, Hospital.Name)]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
1

This came to my mind

 df <- df[with(df, order(State, as.numeric(Mortality.Rate), Hospital.Name)]

Check out this post How to sort a dataframe by column(s)?

Community
  • 1
  • 1
Michael Kaiser
  • 133
  • 1
  • 9
0

assign a variable "result". and also assuming you want to find the avg mortality for each state

result <- df %<%
                 arrange(Mortality.Rate) %<%
                 order_by(State) %<%
                 summarize(mean(Mortality.Rate)
view(result)

Waldi
  • 39,242
  • 6
  • 30
  • 78
Emeka
  • 1