Calculating a new column in pandas dataframe which is a subset of values returns column not found error

Question

I'm interested in finding the sum of values in a column creating a new column in the process on a subset of a dataframe meeting some condition. I'm not sure of how to work the sum of a new column from these two as I get an error when I try to access the New column created in the process:

import pandas as pd 

d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
    'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
    'RUN':[1,1,1,1,2,2,2,2,3,3,3,3]
    }
df=pd.DataFrame(d1)

for RUNno in (df.RUN.unique()):
    df1=df.RUN==RUNno #Selects the rows matching RUNno
    df[df1]['NewColumn']=df[df1]['X']+df[df1]['Y'] #For the selected dataset, calculates the sum of two columns and creates a new column
    print(df[df1].NewColumn) #Print the contents of the new column

I am unable to get df[df1].NewColumn contents as it is unable to identify the Key NewColumn. I'm pretty sure this way of creating new columns works on the standard dataframe df but not sure why it doesn't work on df[df1]. For eg.

df['NewColumn']=df['X']+df['Y'] 
df.NewColumn

Would work seamlessly.

To update the question, the columns data entries that are added to form the new column are from two different dataframes.

import pandas as pd 
from scipy.interpolate import interp1d 
interpolating_functions=dict() 
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 
    'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 
    'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } 
d2={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 
    'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 
    'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } 
df=pd.DataFrame(d1) 
df2=pd.DataFrame(d2)
for RUNno in (df.RUN.unique()):
    df1=df.RUN==RUNno 
    df3=df.RUN==RUNno 
    interpolating_functions[RUNno]=interp1d(df2[df3].X,df2[df3].Y) 
    df[df1]['NewColumn']=df[df1]['X']+interpolating_functions[RUNno](df2[df3]['X']) 
    print(df[df1].NewColumn)

jezrael · Accepted Answer · 2019-02-14T14:07:14.183

1

Use custom function with GroupBy.apply with create new column and then return each group - here x:

def func(x):
    #check groups
    print (x)
    #working with groups DataFrame x
    x['NewColumn']=x['X']+x['Y']
    return x

df = df.groupby('RUN').apply(func)

print (df)
       X      Y  RUN  NewColumn
0      1  0.200    1      1.200
1     10  0.500    1     10.500
2    100  0.400    1    100.400
3   1000  1.200    1   1001.200
4      1  0.100    2      1.100
5     10  0.250    2     10.250
6    100  0.200    2    100.200
7   1000  0.600    2   1000.600
8      1  0.050    3      1.050
9     10  0.125    3     10.125
10   100  0.100    3    100.100
11  1000  0.300    3   1000.300

It seems you need loc for select columns by masks, only necessary same length of index in both DataFrames:

for RUNno in (df.RUN.unique()):
    df1=df.RUN==RUNno 
    df3=df.RUN==RUNno 
    interpolating_functions[RUNno]=interp1d(df2.loc[df3, 'X'], df2.loc[df3,'Y']) 

    df.loc[df1, 'NewColumn'] = df.loc[df1, 'X'] + interpolating_functions[RUNno](df2.loc[df3, 'X']) 

print (df)
       X      Y  RUN  NewColumn
0      1  0.200    1      1.200
1     10  0.500    1     10.500
2    100  0.400    1    100.400
3   1000  1.200    1   1001.200
4      1  0.100    2      1.100
5     10  0.250    2     10.250
6    100  0.200    2    100.200
7   1000  0.600    2   1000.600
8      1  0.050    3      1.050
9     10  0.125    3     10.125
10   100  0.100    3    100.100
11  1000  0.300    3   1000.300

edited Feb 14 '19 at 14:07

answered Feb 14 '19 at 11:43

jezrael

822,522
95
1,334
1,252

Thanks @jezrael. I am using the df1=df.RUN==RUNno as eventually I have to add the column X values to an array of linear Interpolation values based on RUN0 like interp1d[RUNno](df[df1]['X'])+ df[df1]['X']. So groupby can't be used. The linear interpolation will be from another dataframe but uses some values from this dataframe. I tried to keep the problem statement simple. Is there any reason why df[df1]['Newcolumn'] doesn't work to create a new column ? – SSM Feb 14 '19 at 12:15
@SSM - Not understand. Why is not possible use it in `custom function` ? `x` in each groups is like `df1` in your solution. – jezrael Feb 14 '19 at 12:18
I need the RUNno value to select from an array of lists to run the interpolation on. I have a Interpolation[RUNno] which is a scipy.interpolate.interplate.interp1d array . So first I select a subset from df dataframe based on RUNno and then use the same RUNno to select which list of values to select from Interpolatin[RUNno] array. Had the two sums been selected from the same dataframe, it would be straightforward to use the groupby('RUN') but Interpolation is from 'df2' dataframe and X column is from 'df' dataframe which complicates for me. – SSM Feb 14 '19 at 12:26
@SSM - Is possible add this second array? I think it is OK if question should be a bit complictaed, but better for real data. – jezrael Feb 14 '19 at 12:28
@SSM - Not understand, can you change sample? – jezrael Feb 14 '19 at 13:32
import pandas as pd from scipy.interpolate import interp1d interpolating_functions=dict() d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } d2={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } df=pd.DataFrame(d1) df2=pd.DataFrame(d2) – SSM Feb 14 '19 at 13:48
for RUNno in (df.RUN.unique()): df1=df.RUN==RUNno df3=df.RUN==RUNno interpolating_functions[RUNno]=interp1d(df2[df3].X,df2[df3].Y) df[df1]['NewColumn']=df[df1]['X']+interpolating_functions[RUNno](df2[df3]['X']) print(df[df1].NewColumn) – SSM Feb 14 '19 at 13:48
I have attached the example in two comments above as there is a character limitation per comment. Please read them in concatenated sections. – SSM Feb 14 '19 at 13:50
@SSM - Can you add it to question by [edit](https://stackoverflow.com/posts/54689566/edit) ? – jezrael Feb 14 '19 at 13:53
@SSM - Can you check now? – jezrael Feb 14 '19 at 14:07
Thanks @jezrael, it works now. Could you kindly help explain the rationale for the changes to .loc and why the original process didn't work. Will definitely help to learn from the experts ! – SSM Feb 15 '19 at 17:17
@SSM - If dont use `.loc` is possible chained indexing and assign back not working, check [this](https://stackoverflow.com/a/53954986/2901002) for more info. – jezrael Feb 16 '19 at 06:18

Calculating a new column in pandas dataframe which is a subset of values returns column not found error

1 Answers1