1

I have a data frame like this,

col1     col2
 100      3
 200      2
 300      4
 400      1

Now I want to have median on col1 in such way col2 values will be the weights for each col1 values like this,

median of [100, 100, 100, 200, 200, 300, 300, 300, 300, 400] # 100 is 3 times as the weight is 3

I can do it by creating multiple rows based on weights but I can't allow more rows, is there any way to do it more efficiently without creating multiple rows either in python or pyspark

Kallol
  • 2,089
  • 3
  • 18
  • 33

1 Answers1

1

Repeat the values then calculate median

df.loc[df.index.repeat(df['col2']), 'col1'].median()

250.0
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53