0

Given three dataframes, one contains user data, the second one contains data binning and the third are category names as in:

klasses_df = pd.DataFrame([[1, 'Sad'],
                           [7, 'Regular'],
                           [13, 'Happy'],
                           [42, 'Magical']],
                           columns=['klass', 'mood'])
                           
bins_df = pd.DataFrame([[0.0, 3.0, 1],
                        [3.0, 6.0, 7],
                        [6.0, 8.0, 13]],
                       columns=['lower', 'upper', 'klass'])


person_df = pd.DataFrame([['John', 1.5],
                          ['Mary', 3.6],
                          ['Paul', 7.2],
                          ['Josh', 5.7],
                          ['Phil', 9.9]],
                         columns=['name', 'feeling'])

I would like to extend the person_df (or create a new dataframe) where the correct klass_id and mood can be found. For example in the first row of person_df, John's feelings are at 1.5, checking in bins_df we can see that is in range first range [0, 3] hence at klass 1. Looking klasses_df we find that klass_id 1 is Sad. This will make the final/new row related to Jonh as John, 1.5, 1, 'Sad'.

To achieve that I have created two auxiliary funcions:

def find_klass_from_feeling(feeling, bin_data):
    values = bin_data.values
    klass = values[(values[:,0] <= feeling) & (feeling < values[:,1])][:,2]
    if len(klass) == 0:
        return 0
    else:
        return int(klass.flatten()[0])

def find_mood_from_class(klass, klasses_data):
    if klass == 0:
        return None
    retval = klasses_df[klasses_df['klass'] == klass]['mood'].iloc[0]
    return retval

And I call them as:


final_df = person_df.copy()
klss = []
moods = []
for idx, row in person_df.iterrows():
    kls = find_klass_from_feeling(row['feeling'], bins_df)
    mood = find_mood_from_class(kls, klasses_df)
    klss.append(kls)
    moods.append(mood)
    
final_df['klass'] = klss
final_df['mood'] = moods

It works but seems completely wrong, since I believe, pandas has some more proper way to handle it. I tried to use apply and applymap without success.

Any hints are welcome.

Lin
  • 1,145
  • 11
  • 28
  • Look into "non-equi joins" such as discussed here: https://stackoverflow.com/questions/50573457/pandas-equivalent-of-sql-non-equi-join – teepee Nov 18 '20 at 22:49
  • can you rewrite the binning to different format, or does it have to be this data frame (i.e. this is more manageable if bins can be in a list)? also, what happens to Phil, who's score is outside of bin range? – yulGM Nov 19 '20 at 02:00
  • also, your bins have same value at start and end. if a person lands on 3, for example, should they belong to bin 1 or 7? – yulGM Nov 19 '20 at 02:02
  • `lower <= value < upper` solver. It needs the same values since there is no way to represent the limit. – Lin Nov 20 '20 at 15:41

1 Answers1

1

first there are couple of issues with the question:

  • does the range include right value or left? For now I assume that no, and that value 3 would go to bin 1.
  • what happens to values over 8? For now I create new category '99' for all values over 8.

to solve your problem, you can use pandas pd.cut()

You need to have your bins and labels as list:

you can declare it manually:

bins = [0.0, 3.0, 6.0, 8.0, float("inf")]
labels = [1, 7, 13, 99]

or if your data comes from outside, you can convert it:

bins = [0.0]+bins_df['upper'].to_list()+[float("inf")]
label = bins_df['klass'].to_list() + [99]

(I use 99 as a dummy value, to indicate missing class for values over 8)

Now use pd.cut() to bin person_df['feelings'] to class:

person_df['klass']=pd.cut(person_df['feeling'], bins = bins, labels=labels)

And then link it to the mood to get the value:

result = pd.merge(person_df, klasses_df, on='klass', how='left')
print(result)

output:

   name  feeling  klass     mood
0  John      1.5      1      Sad
1  Mary      3.6      7  Regular
2  Paul      7.2     13    Happy
3  Josh      5.7      7  Regular
4  Phil      9.9     99      NaN

(Phil is outside specified range, and therefore has no mood)

yulGM
  • 894
  • 1
  • 5
  • 14