2

I have thousands of dataset like this:

>student1
    quantities score
[1]          4    10         
[2]          1    12         
[3]         78     5         
[4]          6   294

I'd like to calculate the median of scores for this student. For every score, we have some quantities. In this case, I want it to return 5 since the median is one of the 78 5s.

I've looked at some posts here like how to calculate the median on grouped dataset? , but I can't use that because I have thousands of dataset.

I've also tried to install aroma.light package and matrixstats package but still, I can't use the "weighted.median function" thing. It tells me

Error: could not find function "weightedMedians"

ok, above is just an example, my real dataset is like:

>test
     [,1]          [,2]
info    3            10
info    2            20
        4      86779637
        1        135777
        7          2342

but when I tried to use

>rep(test[, 1], test[, 2])

it appears

Error in rep(test[, 1], test[, 2]) : invalid 'times' argument
In addition: Warning message:
NAs introduced by coercion 

what can I do now?

Community
  • 1
  • 1
Natalia
  • 369
  • 3
  • 15

2 Answers2

2

You can just use:

median(rep(student1$score, student1$quantities))

This is relatively fast (takes only a few seconds with a simulated dataset of 100k rows)

nico
  • 50,859
  • 17
  • 87
  • 112
0

The function for calculating the weighted median in the matrixStats package is called weightedMedian() (without a plural 's'), e.g.

> library("matrixStats")
matrixStats v0.14.0 (2015-02-13) successfully loaded. See ?matrixStats for help.
> weightedMedian(student1$score, w=student1$quantities)
[1] 5.670732
> weightedMedian(student1$score, w=student1$quantities, interpolate=FALSE)
[1] 5
HenrikB
  • 6,132
  • 31
  • 34