1

I have a data frame having columns ( Name, a, b) and I want to create a columns name"mean" which would be mean of column a and b, but if mean of any two-row is same then whosever sum value is more should decrease by 0.1.

   data frame 1

  Name  Sum  a   b      mean
0 hamm   34  2   2       2
1 jam    54  1   1  -->  1
2 tan    36  3   1       2
3 pan    39  4   4       4

as we see now row 0 and 2 have the same mean value, so now whosever sum value is more should decrease by 0.1

Here, in this case, its row no 2 should have value 2- 0.1 = 1.9

Final Result

  Name  Sum  a   b   mean
0 hamm   34  2   2    2
1 jam    54  1   1    1
2 tan    36  3   1    1.9
3 pan    39  4   4    4
martineau
  • 119,623
  • 25
  • 170
  • 301
Amit
  • 763
  • 1
  • 5
  • 14
  • 2
    What problem are you trying to solve by doing this? I can't think of a reason why it would make any mathematical sense. – Karl Knechtel Apr 30 '20 at 07:32
  • it would definitely make sense ... here "a" and "b" column is a rank for features that I got using different ML models and I want to take a mean of it so I would come to know what features rank overall good . and if the rank ties, then I want to apply this condition which specified, so the one having greater sum should show up. @KarlKnechtel – Amit Apr 30 '20 at 07:40
  • What's happen if there are 3 rows with same mean ? – Alexandre B. Apr 30 '20 at 08:13
  • That's y I want to write generic code so that It could handle situation like this. @AlexandreB. – Amit Apr 30 '20 at 09:16
  • So what is the answer to my question ? – Alexandre B. Apr 30 '20 at 09:19
  • 1
    then the one with the greatest sum will be (mean -0.2 )and the 2nd on will be (mean -0.1 ) and the last one remains unchanged. @AlexandreB. – Amit Apr 30 '20 at 12:08

1 Answers1

1

You can try mean and cumcount:

df.assign(mean = df[["a", "b"]].mean(axis=1))\
  .assign(mean = df["mean"].subtract(df.groupby("mean").cumcount().divide(10)))

output

#    Name  Sum  a  b  mean
# 0  hamm   34  2  2   2.0
# 1   jam   54  1  1   1.0
# 2   tan   36  3  1   1.9
# 3   pan   39  4  4   4.0

Explanations:

  1. Compute the mean using mean. We specify axis=1 to compute it on rows.

  2. For each identical mean, we want to substract n*0.1.

    1. We use groupby to group all rows with same mean
    2. We get their number using cumcount. See this discussion for more details.
    3. Divide by 10 using divide in order to convert the counter to 0.1, 0.2, ...
  3. Subtract the output from step 2 to the mean column using subtract


Full code + illustration


# Step 1
df["mean"] = df[["a", "b"]].mean(axis=1)
print(df)
#    Name  Sum  a  b  mean
# 0  hamm   34  2  2   2.0
# 1   jam   54  1  1   1.0
# 2   tan   36  3  1   2.0
# 3   pan   39  4  4   4.0

# Step 2.1 + 2.2
print(df.groupby("mean").cumcount())
# 0    0
# 1    0
# 2    1
# 3    0
# dtype: int64

# Step 2.3
print(df.groupby("mean").cumcount().divide(10))
# 0    0.0
# 1    0.0
# 2    0.1
# 3    0.0
# dtype: float64

# Step 3
df["mean"] = df["mean"].subtract(df.groupby("mean").cumcount().divide(10))
print(df)
#    Name  Sum  a  b  mean
# 0  hamm   34  2  2   2.0
# 1   jam   54  1  1   1.0
# 2   tan   36  3  1   1.9
# 3   pan   39  4  4   4.0
Alexandre B.
  • 5,387
  • 2
  • 17
  • 40
  • its doesn't applicable to the problem statement. for ex . in this particular problem it doesn't apply for df.mean ..row 0 & 2. – Amit Apr 30 '20 at 08:08