Get data from one column based on criteria from another in multilevel dataframe

Question

Using a multilevel dataframe showing the price and the factor for products alfa - delta, Im trying create a new dataseries with the average price of the two products with highest Factor. Eg, average price of alfa and bravo if alfa and bravo have the highest factors.

import pandas as pd
import numpy as np

index = [np.array(['price', 'price', 'price', 'price', 'factor', 'factor', 'factor', 'factor']),
         np.array(['alfa', 'bravo', 'charlie', 'delta', 'alfa', 'bravo', 'charlie', 'delta'])]
df = pd.DataFrame(np.random.randn(3, 8), index=['2014', '2015', '2016'], columns=index)

df

Out[1]:
                                            price                              factor
             alfa   bravo        charlie    delta       alfa        bravo        charlie    delta
2014    -1.078024   -2.370577   1.809694    0.937910    0.643634    -1.167022   -0.013712   0.026595
2015    -0.374975   1.459360    0.875787    -1.407601   -1.220319   0.604929    0.414953    0.053431
2016    -0.265826   1.261522    0.839443    -0.144880   0.157955    -1.050584   -0.909444   0.687804

Each product has 3 factor how do you define the highest factor? Can you add your expected output (maybe with a simpler and minimal example) — Mazdak, May 09 '17 at 11:19
row by row. Im after a new dataseries with the same index (2014, 2015 and 2016), but where the values are the average price of the two with the highest factor. so for 2014 where alfa and delta have the highest factor, we average -1.078024 and 0.93791) — NRVA, May 09 '17 at 11:27

score 0 · Accepted Answer · edited May 23 '17 at 12:34

You can use:

xs for select DataFrame by levels
numpy.argsortfor indexes of highest values
values by indices
numpy.mean with Series.

np.random.seed(123)
np.random.seed(123)
index = [['price'] * 4 + ['factor'] * 4, ['alfa','bravo','charlie','delta'] * 2]
df = pd.DataFrame(np.random.rand(3,8), index=['2014', '2015', '2016'], columns=index)
#print (df)

dff = df.xs('factor', axis=1, level=0)
print (dff)
          alfa     bravo   charlie     delta
2014  0.719469  0.423106  0.980764  0.684830
2015  0.438572  0.059678  0.398044  0.737995
2016  0.634401  0.849432  0.724455  0.611024

a = (np.argsort(-dff.values, axis=1)[:, :2])
print (a)
[[2 0]
 [3 0]
 [1 2]]

#check columns with highest values
print (dff.columns[a])
Index([['charlie', 'alfa'], ['delta', 'alfa'], ['bravo', 'charlie']], dtype='object')

dfp = df.xs('price', axis=1, level=0)
print (dfp)
          alfa     bravo   charlie     delta
2014  0.696469  0.286139  0.226851  0.551315
2015  0.480932  0.392118  0.343178  0.729050
2016  0.182492  0.175452  0.531551  0.531828

b = dfp.values[np.arange(len(df.index))[:,None], a][:,:2]
print (b)
[[ 0.22685145  0.69646919]
 [ 0.72904971  0.4809319 ]
 [ 0.17545176  0.53155137]]

c = pd.Series(np.mean(b, axis=1), index=df.index)
print (c)
2014    0.461660
2015    0.604991
2016    0.353502
dtype: float64

Get data from one column based on criteria from another in multilevel dataframe

1 Answers1