1

I am trying to calculate a weighted mean in Sparklyr, but it doesn't seem the weighted.median function in R is compatible with Sparklyr. I tried pulling it out of sparklyr with a collect() command and then doing a weighted median using regular r, but it hangs then crashes with a out of memory error because the data is so huge it actually requires Sparklyr in a distributed Hadoop environment. I tried reducing the number of columns and rows down to the bare minimum, but I just can't figure out how to get a weighted median IN Sparklyr without moving it out of Sparklyr.

I don't have any code to show because my approach of pulling it out of Spark and putting it into regular R on a single server crashes when the full data is pulled. It's not a viable approach to pull it from Sparklyr to regular R.

ndmeiri
  • 4,979
  • 12
  • 37
  • 45
mega_dan
  • 11
  • 1
  • Computing median doesn't scale in the first place. To have a scalable process you need sample or approximate. Some form of weighting can be incorporated into this process by replicating rows. – zero323 Jul 25 '18 at 00:17
  • I only need a weighted overall population median, we can't sample it. We are trying to recreate what another process created in SAS in Sparklyr. If we produce a weighted median based on a sample, our statistical reviewer will deem that unacceptable. We can't use a simple median because we have issues with data collection that causes us to have to weight the data. Realistically, we need to be able to find the median across the distributed partitions of data. – mega_dan Jul 26 '18 at 13:24
  • Well.... If you really need this, you can try to replicate [this process](https://stackoverflow.com/a/31437177) (with addition of weighting) but it might require your own Scala extension if you want it to scale. – zero323 Jul 26 '18 at 13:28

0 Answers0