2

Using a multilevel dataframe showing the price and the factor for products alfa - delta, Im trying create a new dataseries with the average price of the two products with highest Factor. Eg, average price of alfa and bravo if alfa and bravo have the highest factors.

import pandas as pd
import numpy as np
​
index = [np.array(['price', 'price', 'price', 'price', 'factor', 'factor', 'factor', 'factor']),
         np.array(['alfa', 'bravo', 'charlie', 'delta', 'alfa', 'bravo', 'charlie', 'delta'])]
df = pd.DataFrame(np.random.randn(3, 8), index=['2014', '2015', '2016'], columns=index)
​
df

Out[1]:
                                            price                              factor
             alfa   bravo        charlie    delta       alfa        bravo        charlie    delta
2014    -1.078024   -2.370577   1.809694    0.937910    0.643634    -1.167022   -0.013712   0.026595
2015    -0.374975   1.459360    0.875787    -1.407601   -1.220319   0.604929    0.414953    0.053431
2016    -0.265826   1.261522    0.839443    -0.144880   0.157955    -1.050584   -0.909444   0.687804
Mazdak
  • 105,000
  • 18
  • 159
  • 188
NRVA
  • 507
  • 3
  • 20
  • Each product has 3 factor how do you define the highest factor? Can you add your expected output (maybe with a simpler and minimal example) – Mazdak May 09 '17 at 11:19
  • row by row. Im after a new dataseries with the same index (2014, 2015 and 2016), but where the values are the average price of the two with the highest factor. so for 2014 where alfa and delta have the highest factor, we average -1.078024 and 0.93791) – NRVA May 09 '17 at 11:27

1 Answers1

0

You can use:


np.random.seed(123)
np.random.seed(123)
index = [['price'] * 4 + ['factor'] * 4, ['alfa','bravo','charlie','delta'] * 2]
df = pd.DataFrame(np.random.rand(3,8), index=['2014', '2015', '2016'], columns=index)
#print (df)

dff = df.xs('factor', axis=1, level=0)
print (dff)
          alfa     bravo   charlie     delta
2014  0.719469  0.423106  0.980764  0.684830
2015  0.438572  0.059678  0.398044  0.737995
2016  0.634401  0.849432  0.724455  0.611024

a = (np.argsort(-dff.values, axis=1)[:, :2])
print (a)
[[2 0]
 [3 0]
 [1 2]]

#check columns with highest values
print (dff.columns[a])
Index([['charlie', 'alfa'], ['delta', 'alfa'], ['bravo', 'charlie']], dtype='object')
dfp = df.xs('price', axis=1, level=0)
print (dfp)
          alfa     bravo   charlie     delta
2014  0.696469  0.286139  0.226851  0.551315
2015  0.480932  0.392118  0.343178  0.729050
2016  0.182492  0.175452  0.531551  0.531828

b = dfp.values[np.arange(len(df.index))[:,None], a][:,:2]
print (b)
[[ 0.22685145  0.69646919]
 [ 0.72904971  0.4809319 ]
 [ 0.17545176  0.53155137]]

c = pd.Series(np.mean(b, axis=1), index=df.index)
print (c)
2014    0.461660
2015    0.604991
2016    0.353502
dtype: float64
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252