0

I want to apply a function to each column in the pandas dataframe if a specific column in that dataframe meets a condition. I have a dataframe with 6 columns and 5 rows. The 6th column is the sum of the first 5 columns and if the sum is greater than 1 for a specific row, I want to multiply all the columns in that row with a number (scalar) to ensure that the sum of that row is lower than 1. Below is a simplified dataframe (my original dataframe has 20 columns and 4 million rows).

    A       B       C       D      E        Sum
1   0.004   0.04    0.08    0.6    0.013    0.737
2   0.12    0.25    0.08    0.6    0.014    1.064
3   0.05    0.02    0.08    0.3    0.019    0.469
4   0.08    0.003   0.05    0.1    0.011    0.244
5   0.56    0.04    0.08    0.7    0.016    1.396

I want to multiply each column on the 2nd and 5th rows by a number to be able to make the sum of those columns less than 1.

I tried to apply the following function to the dataframe but apparently, this code applies that function to each value in the dataframe and I also could not figure out how to select the rows whose sum values are greater than 1.

def func(value):
    if value > 1:
        return(value * 0.71)
    else:
        return(value)
  • 3
    Does this answer your question? [Pandas conditional creation of a series/dataframe column](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column) – Eric Leung Dec 08 '20 at 02:24

3 Answers3

0

You can use the .where() method. It might be a bit confusing but it "Replace[s] values where the condition is False."

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df
#    A  B
# 0  0  1
# 1  2  3
# 2  4  5
# 3  6  7
# 4  8  9

df.where(df <= 1, df * 0.71)
#       A     B
# 0  0.00  1.00
# 1  1.42  2.13
# 2  2.84  3.55
# 3  4.26  4.97
# 4  5.68  6.39

The second condition (here above is df * 0.71) is what you want to do to the values where the condition (here above is df <= 1) is False.

So notice that when the values are not less than or equal to one (i.e., similar to your question to change only those greater than 1), then apply the multiplication.

Eric Leung
  • 2,354
  • 12
  • 25
  • Thank you very much for your answer Eric. Actually, I am calculating the scalar by dividing the desired sum (which is the maximum value of another dataset) by the sum of the corresponding row in this dataframe. And then I am multiplying each column value by this scalar and in the end it gives the desired sum. – MEHMET ALTINPINAR Dec 08 '20 at 06:33
0

If you don't care what the scalar is so long as the totals are less than 1 after applying the scalar, this will work.

It splits the df into two parts, one with Sum > 1 and the other with Sum <= 1 To ensure all rows in scaled df sum less than one, find the max of in Sum column and divide all multiply by .99/max. Then concatenate both dataframes to get your final df.

It should be faster than using apply or transform or anything else that is iterative. Hopefully you can work with this otherwise, back to the drawing board.

input = '''A B C D E Sum
0.004 0.04 0.08 0.6 0.013 0.737
0.12 0.25 0.08 0.6 0.014 1.064
0.05 0.02 0.08 0.3 0.019 0.469
0.08 0.003 0.05 0.1 0.011 0.244
0.56 0.04 0.08 0.7 0.016 1.396'''

df = pd.read_csv(io.StringIO(input), sep=' ')
df1 = df.loc[df['Sum'] > 1]
df2 = df.loc[df['Sum'] <= 1]
max_divisor = df1['Sum'].loc[df['Sum'] >= 1].max()

df1 = df1.mul(.99/max_divisor)
df_scaled = pd.concat([df1, df2], axis=0)

Output:

In [46]: df_scaled
Out[46]:
          A         B         C         D         E       Sum
1  0.085100  0.177292  0.056734  0.425501  0.009928  0.754556
4  0.397135  0.028367  0.056734  0.496418  0.011347  0.990000
0  0.004000  0.040000  0.080000  0.600000  0.013000  0.737000
2  0.050000  0.020000  0.080000  0.300000  0.019000  0.469000
3  0.080000  0.003000  0.050000  0.100000  0.011000  0.244000
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
  • Thank you very much for your answer Jonathan. As I stated in Eric's answer above, I am calculating the scalar by dividing the desired sum (which is the maximum value of another dataset) by the sum of the corresponding row in this dataframe. And then I am multiplying each column value by this scalar and in the end it gives the desired sum. – – MEHMET ALTINPINAR Dec 08 '20 at 06:34
0

To do this, we isolate the rows with a sum larger than 1 and divide them by a number slightly larger than their sum (their sum times 1.1). Then we recompute the sum

df = pd.DataFrame.from_records([{'A': 0.004, 'B': 0.04, 'C': 0.08, 'D': 0.6, 'E': 0.013000000000000001, 'Sum': 0.737}, {'A': 0.12, 'B': 0.25, 'C': 0.08, 'D': 0.6, 'E': 0.013999999999999999, 'Sum': 1.064}, {'A': 0.05, 'B': 0.02, 'C': 0.08, 'D': 0.3, 'E': 0.019, 'Sum': 0.469}, {'A': 0.08, 'B': 0.003, 'C': 0.05, 'D': 0.1, 'E': 0.011000000000000001, 'Sum': 0.244}, {'A': 0.56, 'B': 0.04, 'C': 0.08, 'D': 0.7, 'E': 0.016, 'Sum': 1.396}])

big_sum = df['Sum'] > 1
df.loc[big_sum, 'A':'E'] = df.loc[big_sum, 'A':'E'].div((df.loc[big_sum, 'Sum'] * 1.1), axis=0)
df.loc[big_sum, 'Sum'] = df.loc[big_sum, 'A':'E'].sum(axis=1)

>>> df
          A         B         C         D         E       Sum
1  0.004000  0.040000  0.080000  0.600000  0.013000  0.737000
2  0.102529  0.213602  0.068353  0.512645  0.011962  0.909091
3  0.050000  0.020000  0.080000  0.300000  0.019000  0.469000
4  0.080000  0.003000  0.050000  0.100000  0.011000  0.244000
5  0.364678  0.026048  0.052097  0.455848  0.010419  0.909091
BallpointBen
  • 9,406
  • 1
  • 32
  • 62