1

I am working on the pandas dataset. For 2D dataframe try to return/append one column which return the column name whose value is over 0.95.

import pandas as pd
import numpy as np

Exp_day_list = ["EXP_DAY_1","EXP_DAY_2","EXP_DAY_3","EXP_DAY_4","EXP_DAY_5","EXP_DAY_6","EXP_DAY_7","EXP_DAY_8","EXP_DAY_9","EXP_DAY_10","EXP_GT_DAY_10"]


test = raw_databased.head()
Exp_day_percentage = test[Exp_day_list]


def over_95_percent(x):
    for column in x:
        if x[column] > 0.95:
            return column
            break
Exp_day_percentage.apply(over_95_percent,axis = 1)

I test Exp_day_percentage and result is as what I need.

Exp_day_percentage
Out[2]: 
   EXP_DAY_1  EXP_DAY_2  EXP_DAY_3  EXP_DAY_4  EXP_DAY_5  EXP_DAY_6  \
0        0.0        0.0       0.52       0.94       0.94        1.0   
1        0.0        0.0       0.00       0.66       1.00        1.0   
2        0.0        1.0       1.00       1.00       1.00        1.0   
3        0.0        0.0       0.92       1.00       1.00        1.0   
4        0.0        0.0       0.95       0.97       1.00        1.0   

   EXP_DAY_7  EXP_DAY_8  EXP_DAY_9  EXP_DAY_10  EXP_GT_DAY_10  
0        1.0        1.0        1.0         1.0            0.0  
1        1.0        1.0        1.0         1.0            0.0  
2        1.0        1.0        1.0         1.0            0.0  
3        1.0        1.0        1.0         1.0            0.0  
4        1.0        1.0        1.0         1.0            0.0  

but when I run the apply function to that dataframe, error function as following:

TypeError: ("cannot do label indexing on <class 'pandas.indexes.base.Index'>  
with these indexers [0.0] of <type 'numpy.float64'>", u'occurred at index 0')

ideal result will be following:

   EXP_DAY_1  EXP_DAY_2  EXP_DAY_3  EXP_DAY_4  EXP_DAY_5  EXP_DAY_6  \
0        0.0        0.0       0.52       0.94       0.94        1.0   
1        0.0        0.0       0.00       0.66       1.00        1.0   
2        0.0        1.0       1.00       1.00       1.00        1.0   
3        0.0        0.0       0.92       1.00       1.00        1.0   
4        0.0        0.0       0.95       0.97       1.00        1.0   

   EXP_DAY_7  EXP_DAY_8  EXP_DAY_9  EXP_DAY_10  EXP_GT_DAY_10  Column
0        1.0        1.0        1.0         1.0            0.0  EXP_DAY_5
1        1.0        1.0        1.0         1.0            0.0  EXP_DAY_5
2        1.0        1.0        1.0         1.0            0.0  EXP_DAY_2
3        1.0        1.0        1.0         1.0            0.0  EXP_DAY_4
4        1.0        1.0        1.0         1.0            0.0  EXP_DAY_3

if anyone can help me on that, I would much appreciate that. I search all internet and could not find similar thing. thank you

cs95
  • 379,657
  • 97
  • 704
  • 746
Michael Li
  • 647
  • 2
  • 8
  • 20

1 Answers1

1

Use pd.DataFrame.idxmax

df.assign(Column=df.gt(.95).assign(zip5=1).idxmax(1))

   EXP_DAY_1  EXP_DAY_2  EXP_DAY_3  EXP_DAY_4  EXP_DAY_5  EXP_DAY_6  EXP_DAY_7  EXP_DAY_8  EXP_DAY_9  EXP_DAY_10  EXP_GT_DAY_10     Column
0        0.0        0.0       0.52       0.94       0.94        1.0        1.0        1.0        1.0         1.0            0.0  EXP_DAY_6
1        0.0        0.0       0.00       0.66       1.00        1.0        1.0        1.0        1.0         1.0            0.0  EXP_DAY_5
2        0.0        1.0       1.00       1.00       1.00        1.0        1.0        1.0        1.0         1.0            0.0  EXP_DAY_2
3        0.0        0.0       0.92       1.00       1.00        1.0        1.0        1.0        1.0         1.0            0.0  EXP_DAY_4
4        0.0        0.0       0.95       0.97       1.00        1.0        1.0        1.0        1.0         1.0            0.0  EXP_DAY_4
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • thank you so much for your help. how do you know gt will work in this case? I looked at the document before from pandas but I had no idea where to find the related function? In addition, can you tell me why my code does not work? – Michael Li Nov 07 '17 at 03:54
  • `pd.DataFrame.gt` is the greater than method. It is synonymous with `df > .95` or the full thing `(df > .95).idxmax(1)` – piRSquared Nov 07 '17 at 03:57
  • Hi, piRSquared. Questions on this topic. I found if gt return all false, idxmax will return all first column. How can I replace that with certain constant value list NA. (returning first column will make me to clean data again. – Michael Li Nov 13 '17 at 18:57
  • What do you want it to return in that circumstance? – piRSquared Nov 13 '17 at 19:00
  • best would be refer to another column in that table. For example, I have zip5 - EXP_DAY_1 EXP_DAY_2 EXP_DAY_3 EXP_DAY_4 EXP_DAY_5 EXP_DAY_6 EXP_DAY_7 EXP_DAY_8 EXP_DAY_9 EXP_DAY_10 EXP_GT_DAY_10 . as list. If all False, return ZIP5. Thank you – Michael Li Nov 14 '17 at 02:07