2

I am trying to create a function to decrease the amount of my coding. The function uses exec() on a Pandas data frame to do some calculations. But when I put the code inside exec() it returns NaNs. Here is the function:

def FeatureMaker(data, criteria_col, f_data):
    source = data.copy()
    for key, value in f_data.items():
        for i in source[criteria_col].unique():
            source.set_value(source[source[criteria_col] == i].index.tolist(), key, exec('source[source[criteria_col] == i]["' + value[0] + '"].' + value[1] + '()'))
    return source

The function is supposed to get a data frame, find the unique items of a column which has been passed as the column holding the criteria (in this case the 'id' column), and calculate what has been passed to it. In this function, argument data holds the data frame, criteria_col holds the name of a column which has the values of the criteria, and f_data is a dictionary whose keys are the names of the new columns and its values the column on which the calculation has to be performed and the calculation itself.This is an example of how I run the function:

FeatureMaker(spend_alpha, 'id', {'total_sum': ["spend", "sum"]})

In this example, spend_alpha is the name of my data frame, id is my criteria column, and I want to create the total_sum column by calculating the sum of the spend column.

The function runs perfectly. However, it returns nothing but NaNs. I tried the code below:
for i in spend_alpha['id'].unique():
    spend_alpha.set_value(spend_alpha[spend_alpha['id'] == i].index.tolist(), 'total_sum', spend_alpha[spend_alpha['id'] == i]['spend'].sum())

And it works without a problem. I get the total sum of each id. However, this code returns NaNs as well:

for i in spend_alpha['id'].unique():
    spend_alpha.set_value(spend_alpha[spend_alpha['id'] == i].index.tolist(), 'total_sum', exec("spend_alpha[spend_alpha['id'] == i]['spend'].sum()"))

My question is, how can I use exec() on a Pandas data frame without getting NaNs? Thank you in advance.

Edit:

Reproducible example: Lets say I create the spend_alpha data frame as follows...
In [1]: import pandas as pd

In [2]: spend_alpha = pd.DataFrame([[100, 250],
   ...:                             [101, 50],
   ...:                             [102, 60],
   ...:                             [100, 50],
   ...:                             [102, 30],
   ...:                             [101, 50]], columns=['id', 'spend'])

In [3]: spend_alpha
Out[3]:
    id  spend
0  100    250
1  101     50
2  102     60
3  100     50
4  102     30
5  101     50

I can add a new column called total_sum by running the code below...

In [4]: for i in spend_alpha['id'].unique():
   ...:     spend_alpha.set_value(spend_alpha[spend_alpha['id'] == i].index.tolist(), 'total_sum', spend_alpha[spend_alpha['id'] == i]['spend'].sum())
   ...:

In [5]: spend_alpha
Out[5]:
    id  spend  total_sum
0  100    250      300.0
1  101     50      100.0
2  102     60       90.0
3  100     50      300.0
4  102     30       90.0
5  101     50      100.0

However, if I place the third argument of the set_value() function inside exec() it well return NaNs. As shown below:

In [6]: for i in spend_alpha['id'].unique():
   ...:     spend_alpha.set_value(spend_alpha[spend_alpha['id'] == i].index.tolist(), 'total_sum', exec("spend_alpha[spend_alpha['id'] == i]['spend'].sum()"))
   ...:

In [7]: spend_alpha
Out[7]:
    id  spend  total_sum
0  100    250        NaN
1  101     50        NaN
2  102     60        NaN
3  100     50        NaN
4  102     30        NaN
5  101     50        NaN

I want to be able to place the third argument inside exec() so that I can use the line within a function. This well let me calculate other aggregations other than sum() by passing it to the function. But it is not possible since exec() returns NaNs.

Likewise, the function has to calculate the sum of column spend per unique items of column id from the spend_alpha data frame and save it in a column called total_sum. But it returns NaNs.
In [8]: def FeatureMaker(data, criteria_col, f_data):
   ...:     source = data.copy()
   ...:     for key, value in f_data.items():
   ...:         for i in source[criteria_col].unique():
   ...:             source.set_value(source[source[criteria_col] == i].index.tolist(), key, exec('source[source[criteria_col] == i]["' + value[0] + '"].' + value[1] + '()'))
   ...:     return source

In [9]: FeatureMaker(spend_alpha, 'id', {'total_sum': ["spend", "sum"]})
Out[9]:
    id  spend  total_sum
0  100    250        NaN
1  101     50        NaN
2  102     60        NaN
3  100     50        NaN
4  102     30        NaN
5  101     50        NaN

I would be really thankful if anyone can help me overcome this problem.

m.m
  • 125
  • 7
  • Can you post small reproducible sample input data set(s) and your desired data set in your question? Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. – MaxU - stand with Ukraine Sep 21 '17 at 17:50
  • Thanks for the comment. Added a reproducible example. It should be easy to repeat everything I have done by copy/pasting the input codes to Jupyter or Spyder. I am looking forward to your answer. – m.m Sep 23 '17 at 03:34
  • It’s better now, but it’s still not reproducible. In the “FeatureMaker” function you are using/passing two data sets. Please post samples for both of them and your desired data set. I don’t think you need loops at all as it can be done in a vectorized way. But I need to see a second and a resulting data sets in order to be able to help. – MaxU - stand with Ukraine Sep 23 '17 at 06:16
  • Thanks again. I have updated the reproducible example. However, there is a misunderstanding here. Let me try to explain. The "FeatureMaker" function takes only one data frame. That is the argument named "data". The data frame passed to this argument is "spend_alpha". The second argument is a column name which holds the criteria by which the calculation which is to be performed on the data frame has to be separated. For example, the calculation has to be performed on each person so it has to be separated by 'id'. This argument is 'criteria_col' and the column name passed to it is 'id'... – m.m Sep 25 '17 at 12:42
  • ... The third argument is 'f_data'. This argument is a dictionary. The keys of this dictionary are the names of the new columns which is supposed to be created, in this case "total_sum". The value of each key is a list. This list has two elements, first is the name of the column which the calculation is based on, in this case "spend". And the second element is the calculation which has to be done. For example "sum" means calculate the sum. I hope I was able to clarify things. Thanks for following up on my question. – m.m Sep 25 '17 at 12:46

0 Answers0