2

I would like to add a new level to my DF (so that then I can use pd.reindex to do something else). My DF basically has something like this:

df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
                   ('A','b'): [0,1,2,3,-1],
                   ('A','c'): [-1,1,0,10,12],
                   ('A','d'): [1,1,2,3,-1],
                   ('B','a'): [-20,-10,0,10,20],
                   ('B','b'): [-200,-100,0,-1,200],
                   ('B','c'): [-20,-10,0,10,20],
                   ('B','d'): [-200,-100,0,100,200]
})

##df
    A               B
    a   b   c   d   a   b     c     d
0   -1  0   1   1   -20 -200  -20   -200
1   -1  1   -1  1   -10 -100  -10   -100
2   0   2   0   2   0   0     0     0
3   10  3   10  3   10  -1    10    100
4   12  -1  12  -1  20  200   20    200

I want to assign new level keys L1 = a+b, and L2=c+d. How do I do this?

The desired output would be

##df
    A               B
    L1      L2      L1        L2
    a   b   c   d   a   b     c     d
0   -1  0   1   1   -20 -200  -20   -200
1   -1  1   -1  1   -10 -100  -10   -100
2   0   2   0   2   0   0     0     0
3   10  3   10  3   10  -1    10    100
4   12  -1  12  -1  20  200   20    200

Edit: the objective is to achieve something similar to what was asked in here. This means that some rows will have NAs for the same KEY, depending on the value of other columns. E.g. if I want to filter columns a and c by respectively testing whether columns b and d are negative:

##df
    A               B
    L1      L2      L1        L2
    a   b   c   d   a   b     c     d
0   -1  0   1   1   NA  NA    NA    NA
1   -1  1   -1  1   NA  NA    NA    NA
2   0   2   0   2   0   0     0     0
3   10  3   10  3   NA  NA    10    100
4   NA  NA  NA  NA  20  200   20    200
Community
  • 1
  • 1
Sos
  • 1,783
  • 2
  • 20
  • 46
  • "so that then I can use pd.reindex to do something else": perhaps we can help you better if you state what is your final purpose. – hyamanieu Feb 27 '17 at 13:35
  • Well, the final purpose is very similar to what I had asked in here (http://stackoverflow.com/questions/42246373/replacing-values-in-a-2nd-level-column-on-multiindex-df-in-pandas). The difference now is that I have to do the parsing of columns `a` and `b` (as I did in the previous post), and the same for `c` and `d`. – Sos Feb 27 '17 at 13:40
  • I'm going to change the working example to reflect this comment – Sos Feb 27 '17 at 13:40
  • @Wli I have added some information which I hope may help in assisting me :) – Sos Feb 27 '17 at 13:47
  • It seems someone could answer it nevertheless :) – hyamanieu Feb 27 '17 at 13:59

1 Answers1

3

You need create new array with map and then assign:

d = {'a':'L1','b':'L1','c':'L2','d':'L2'}
a = df.columns.get_level_values(1).map(lambda x: d[x])
print (a)
['L1' 'L1' 'L2' 'L2' 'L1' 'L1' 'L2' 'L2']

df.columns = [df.columns.get_level_values(0),a,df.columns.get_level_values(1)]
#same as
df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0),
                                        df.columns.get_level_values(1).map(lambda x: d[x]),
                                        df.columns.get_level_values(1)])
print (df)
    A             B              
   L1     L2     L1       L2     
    a  b   c  d   a    b   c    d
0  -1  0  -1  1 -20 -200 -20 -200
1  -1  1   1  1 -10 -100 -10 -100
2   0  2   0  2   0    0   0    0
3  10  3  10  3  10   -1  10  100
4  12 -1  12 -1  20  200  20  200

Second output is really complicated, for me works:

#filter columns
idx = pd.IndexSlice
mask = df.loc[:, idx[:,:,['b','d']]] < 0
print (mask)
       A             B       
      L1     L2     L1     L2
       b      d      b      d
0  False  False   True   True
1  False  False   True   True
2  False  False  False  False
3  False  False   True  False
4   True   True  False  False

#create mask to columns a,c
mask1 = mask.reindex(columns=df.columns)
mask1 = mask1.groupby(level=[0,1], axis=1).apply(lambda x: x.bfill(axis=1))
print (mask1)
       A                           B                     
      L1            L2            L1            L2       
       a      b      c      d      a      b      c      d
0  False  False  False  False   True   True   True   True
1  False  False  False  False   True   True   True   True
2  False  False  False  False  False  False  False  False
3  False  False  False  False   True   True  False  False
4   True   True   True   True  False  False  False  False

print (df.mask(mask1))
      A                     B                    
     L1         L2         L1           L2       
      a    b     c    d     a      b     c      d
0  -1.0  0.0  -1.0  1.0   NaN    NaN   NaN    NaN
1  -1.0  1.0   1.0  1.0   NaN    NaN   NaN    NaN
2   0.0  2.0   0.0  2.0   0.0    0.0   0.0    0.0
3  10.0  3.0  10.0  3.0   NaN    NaN  10.0  100.0
4   NaN  NaN   NaN  NaN  20.0  200.0  20.0  200.0

Another solution with reindex and method='bfill', but is necessary double transpose (I think it is bug - works only with MultiIndex in index, not with MultiIndex in columns):

idx = pd.IndexSlice
mask = df.loc[:, idx[:,['b','d']]] < 0
print (mask)
       A             B       
       b      d      b      d
0  False  False   True   True
1  False  False   True   True
2  False  False  False  False
3  False  False   True  False
4   True   True  False  False

mask1 = mask.T.reindex(df.columns, method='bfill').T
print (mask1)
       A                           B                     
       a      b      c      d      a      b      c      d
0  False  False  False  False   True   True   True   True
1  False  False  False  False   True   True   True   True
2  False  False  False  False  False  False  False  False
3  False  False  False  False   True   True  False  False
4   True   True   True   True  False  False  False  False

print (df.mask(mask1))
      A                     B                    
      a    b     c    d     a      b     c      d
0  -1.0  0.0  -1.0  1.0   NaN    NaN   NaN    NaN
1  -1.0  1.0   1.0  1.0   NaN    NaN   NaN    NaN
2   0.0  2.0   0.0  2.0   0.0    0.0   0.0    0.0
3  10.0  3.0  10.0  3.0   NaN    NaN  10.0  100.0
4   NaN  NaN   NaN  NaN  20.0  200.0  20.0  200.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I am getting the following error when setting `mask = df.loc[:, idx[:,:,['b','d']]] < 0`: `KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (3), lexsort depth (2)'`. Changing your `mask` to `mask = df.loc[:, idx[:,['b','d']]] < 0` solves the problem with your working example and yields the correct `print`. However, in my real example I also get a similar error `the index to be fully lexsorted tuple len (3), lexsort depth (1)'`, and I am not able to solve it and get the correct print – Sos Feb 28 '17 at 08:55
  • 1
    I think you need `df = df.sort_index(axis=1)` first, check [docs](http://pandas.pydata.org/pandas-docs/stable/advanced.html#sorting-a-multiindex) – jezrael Feb 28 '17 at 09:04
  • Thank you so much @jezrael, I learned a lot with this and your previous answer to my post! I am truly grateful! – Sos Feb 28 '17 at 09:13
  • 1
    MultiIndexes are really hard, this question is not so nice because there is a hack - `groupby` with `bfill`, but i have no idea how do it better. – jezrael Feb 28 '17 at 09:23