How can I Sum specific columns in a dataframe based on index value provided in a different df. Return the sum and Boolean T/F

Question

I have two pandas dataframes. df_lst contains a list of column names and expected value, and df has a series of data.

The column names in df_lst may change and I use the following script to look up the column index from df that aligns to the column name in df_lst Showing this code incase it is an extra step that might not be needed.

ind_dict = dict((k,i) for i,k in enumerate(d.columns))
inter = set(df_lst['Col_Name']).intersection(df)
df_lst['Index'] = [ ind_dict[x] for x in inter ]

The input for this task would look like this:

import random
import numpy as np
import pandas as pd

a = np.random.randint(12, size=(7, 11))
df = pd.DataFrame(a, ['foo','foo','bar', 'bar', 'bar', 'foo', 'foo'], ['a','b','f','g','h','j' ,'k', 'r', 's', 't', 'z'])

df_lst = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'], 
                   'Expected Value': [100, 90, 122, 111],                                      
                   'Index': [4, 6, 8, 9]})

How can I use the new Index values to look at that corresponding column in df and sum the values and return both the summed value and a 'True' if greater than or 'False' if less than for each row in df_lst

df_out = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'], 
                   'Expected Value': [100, 90, 122, 111],                                      
                   'Index': [4, 6, 8, 9],
                   'Sum of Col': ['sum of col_g', 'sum of col_j', 'sum of col_r', 'sum of col_s'],
                   'Bool': ['True or False', 'True or False', 'True or False', 'True or False']
                   })

Eventually this True/False data will be part of a while loop that checks something like "while 1 or more is false do X"

Don't describe your data in words. Instead, provide all the info needed to understand clearly your problem and that allow us to reproduce it. This means making an example showing a sample of `df`, `df_lst` and the expected output (explained). See here how to make a [good pandas reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Help us to help you. — Rodalm, Nov 03 '21 at 17:16

Henry Ecker · Accepted Answer · 2021-11-03T18:01:50.503

We can select the values from df using the values from df_lst['Index'] and iloc we will need to subtract 1 to convert from 1 based indexing to 0 based. Then sum the column and join back to the DataFrame. We can then compute the Bool column based on the new Sum of Col values:

df_out = df_lst.join(
    df.iloc[:, df_lst['Index'] - 1].sum()
        .add_prefix('Col_')
        .rename('Sum of Col'),
    on='Col_Name'
)

df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']

df_out:

  Col_Name  Expected Value  Index  Sum of Col   Bool
0    Col_g             100      4         106   True
1    Col_j              90      6          97   True
2    Col_r             122      8          95  False
3    Col_s             111      9         113   True

Steps:

Select with iloc notice that integer indexes start at 0 so the g column is at index 3 not 4:

df.iloc[:, df_lst['Index'] - 1]

      g   j   r   s
foo   0   7  14  16
foo  23  13  12  12
bar   5  13   3  16
bar  17  13  24  16
bar  24  14  11  23
foo  17  19  24  17
foo  20  18   7  13

Sum Columns with sum:

df.iloc[:, df_lst['Index'] - 1].sum()
Out[3]: 
g    106
j     97
r     95
s    113
dtype: int64

add_prefix so columns match Col_Name column and rename the Series so that the new column has the correct name:

df.iloc[:, df_lst['Index'] - 1].sum().add_prefix('Col_').rename('Sum of Col')

Col_g    106
Col_j     97
Col_r     95
Col_s    113
Name: Sum of Col, dtype: int64

join together with df_lst:

df_lst.join(
    df.iloc[:, df_lst['Index'] - 1].sum()
        .add_prefix('Col_')
        .rename('Sum of Col'),
    on='Col_Name'
)

  Col_Name  Expected Value  Index  Sum of Col
0    Col_g             100      4         106
1    Col_j              90      6          97
2    Col_r             122      8          95
3    Col_s             111      9         113

Do whatever comparisons are needed and add any additional columns:

df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']

  Col_Name  Expected Value  Index  Sum of Col   Bool
0    Col_g             100      4         106   True
1    Col_j              90      6          97   True
2    Col_r             122      8          95  False
3    Col_s             111      9         113   True

Reproducible Setup:

import pandas as pd
from numpy.random import Generator, MT19937

rng = Generator(MT19937(25))
a = rng.integers(25, size=(7, 11))
df = pd.DataFrame(a, ['foo', 'foo', 'bar', 'bar', 'bar', 'foo', 'foo'],
                  ['a', 'b', 'f', 'g', 'h', 'j', 'k', 'r', 's', 't', 'z'])

df_lst = pd.DataFrame({'Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'],
                       'Expected Value': [100, 90, 122, 111],
                       'Index': [4, 6, 8, 9]})

Thank you for breaking down each step. That is really helpful as I learn more! Ill work in building this into my script and then move onto trying to wrap it into a while loop. — jx24, Nov 03 '21 at 18:23

How can I Sum specific columns in a dataframe based on index value provided in a different df. Return the sum and Boolean T/F

1 Answers1