2

I have a table of data, where I've labeled the rows based on a cluster they fall into, as well as calculated the average of the rows column values. I would like to select the median row for each cluster.

For example sake, just looking at one, I would like to use:

    median(as.numeric(as.vector(subset(df,df$cluster == i )$avg))) 

I can see that

> as.numeric(as.vector(subset(df,df$cluster == i )$avg))
 [1] 48.11111111 47.77777778 49.44444444 49.33333333 47.55555556 46.55555556 47.44444444 47.11111111 45.66666667 45.44444444

And yet, the median is

> median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
[1] 47.5

I would like to find the median record, by matching the median returned with the average in the column, but that isn't possible with this return.

I've found some documentation and questions on rounding with the mean function, but that doesn't seem to apply to this unfortunately.

I could also limit the data decimal places, but some records will be too close, that duplicates will be common if rounded to one decimal.

schm1651
  • 25
  • 1
  • 6
  • When the input has an even number of values (like the 10 values you have) then there is not a value directly in the middle. The standard definition of a median (which R implements) [averages the two middle values](https://en.wikipedia.org/wiki/Median#Finite_set_of_numbers) in the case of an even number of inputs. You could `rank` the data, and in the case of an even-length input select either the `n/2` or `n/2 + 1` record. – Gregor Thomas Apr 01 '19 at 18:48

2 Answers2

0

When the input has an even number of values (like the 10 values you have) then there is not a value directly in the middle. The standard definition of a median (which R implements) averages the two middle values in the case of an even number of inputs. You could rank the data, and in the case of an even-length input select either the n/2 or n/2 + 1 record.

So, if your data was x = c(8, 6, 7, 5), the median is 6.5. You seem to want the index of "the median", that is either 2 or 3.

If we assume there are no ties, then we can get these answers with

which(rank(x) == length(x) / 2)
# [1] 2
which(rank(x) == length(x) / 2 + 1)
# [1] 3

If ties are a possibility, then rank's default tie-breaking method will cause you some problems. Have a look at ?rank and figure out which option you'd like to use.

We can, of course, turn this into a little utility function:

median_index = function(x) {
  lx = length(x)
  if (lx %% 2 == 1) {
    return(match(median(x), x))
  }
  which(rank(x, ties.method = "first") == lx/2 + 1)
}
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • I think this approach would work. It is a little more cumbersome than I was hoping for especially without creating extra objects. I get the result I was looking for with as.numeric(as.vector(subset(df,df$cluster == i )$avg))[which(rank(as.numeric(as.vector(subset(df,df$cluster == i )$avg))) == length(as.numeric(as.vector(subset(df,df$cluster == i )$avg))) / 2)] 'Result=' 47.44444444 – schm1651 Apr 02 '19 at 13:35
  • Repeating `as.numeric(as.vector())` also seems cumbersome. Just make `avg` a numeric column in your data frame and you can stop typing the conversions again and again. This will be a line or 2 if you use `dplyr` or `data.table` (or even `aggregate` if you want to stay in `base`) instead of a for loop with `subset`. Have a look at the [Calculate mean by group R-FAQ](https://stackoverflow.com/q/11562656/903061) and use a custom median function instead of `mean`. – Gregor Thomas Apr 02 '19 at 13:51
  • 1
    Found a bug in your function. `which` should be matching against your length(x) / 2 + 1 calculation. As currently written, your else condition will always return the last element of the vector. Should be `which(rank(x, ties.method = "first") == (lx/2+1))` – AcademicDialysis Nov 24 '21 at 21:35
-1

There is an easier way to do that: use dplyr

library(dplyr)

df%>%
group_by(cluster)%>% 
summarise(Median=median(avg))
Henrique Florencio
  • 3,440
  • 1
  • 18
  • 19
  • This is a great option, which provides a much easier approach to the problem, but ultimately doesn't solve my issue. I want to select/find/label the row which has the median avg. if there is an even number, I'd like it to select the first one. Your code suggestion yields: cluster Median 1 1 47.5 2 2 53.9 3 3 43.4 4 4 60.4 This is great, but I still can't match the median of the cluster to a specific row (i.e.47.5 won't match to 47.4444 which seems to be the median value. – schm1651 Apr 02 '19 at 13:25