-1

I have a dataframe with 3 columns a, b, c like below:

df = pd.DataFrame({'a':[1,1,5,3], 'b':[2,0,6,1], 'c':[4,3,1,4]})

I want to add column d which is sum of some columns in df, but is not the same column for each row, for example

enter image description here

only row 1 and 3 is sum from the same column, row 0 and 2 is sum from others columns.

what I found on Stack over flow is always for certain column for whole dataframe, but in this case it is differnt.

How is the best way I can do it?

actnmk
  • 156
  • 12

2 Answers2

0

Because column d is randomly calculated, the only way to do it for each row, is separately.

df['d'] = 0
df['d'].iloc[0] = df['b'].iloc[0]
df['d'].iloc[1] = df['a'].iloc[1] + df['c'].iloc[1]
df['d'].iloc[2] = df['a'].iloc[2]
df['d'].iloc[3] = df['a'].iloc[3] + df['c'].iloc[3]

If rows 1 and 3, have a rule:

df['d'].loc[(df.index % 2)==1] = df['a'].iloc[df.index] + df['c'].iloc[df.index]

Also, with for-loop:

for i in range(0, 4): 
    if i % 2 == 1: 
        df['d'].iloc[i] = df['a'].iloc[i] + df['c'].iloc[i]
LoukasPap
  • 1,244
  • 1
  • 8
  • 17
  • sorry I have a mistake in the expected df, column d of row 1 and 3 have the same rule, is sum of column a and c, how can I do it for both rows at the same time? – actnmk Jan 06 '21 at 17:49
  • please tell me if it works, so I can change it,if not. @actnmk – LoukasPap Jan 06 '21 at 17:55
  • it works! thank you so much – actnmk Jan 06 '21 at 18:01
  • L. Papadopoulos, this is a blatant duplicate of [Dynamically evaluate expression from formula in pandas?](https://stackoverflow.com/questions/53779986/dynamically-evaluate-expression-from-formula-in-pandas). People will generally vote to close as dupe, and you're not going to earn rep from an answer on a closed dupe. – smci Jan 06 '21 at 18:10
  • @smci i just saw the answer. It is so big, that the guy asking, may get confused. It is better that I wrote what he just needs. – LoukasPap Jan 06 '21 at 18:14
  • 1
    L. Papadopoulos, I didn't say you plagiarized this. I did say that the **question** was a blatant dupe, which I already flagged 30 min ago, and when the question is known to be a dupe, [the proper SO behavior is to vote to close as a dupe in favor of the target question](https://stackoverflow.com/help/flagging), not to answer it. Additionally, you could adapt this answer and post it there (answer would need to restate the example formula you're trying to solve, obviously). – smci Jan 06 '21 at 18:16
  • ...yes that [that target answer](https://stackoverflow.com/a/53779987/202229) is way too long and needs structure and concise specific examples, I'm separately editing it, you could also suggest improvements there via comments. But that's not an excuse to knowingly not close this as a dupe. – smci Jan 06 '21 at 18:17
  • @smci thanks for your comment, I have read the answer and its too Long and for a beginner like me not easy to understand – actnmk Jan 06 '21 at 18:33
  • 1
    @smci i agree that duplicates must close. But actnmk is beginer as he says, and my answer is beginer-friendly, because it is small, and on point, without confusing explanations. The other answer you say, does not need to get smaller, it is like a wiki answer, that analyses a problem and gives multiple solutions. – LoukasPap Jan 06 '21 at 18:49
  • I posted a one-line solution that uses `pd.eval()` to dynamically evaluate a different `df['formula']` on each line, no manual hardcoding required. (Eventually that other long answer needs to be slimmed down and updated with this sort of thing) – smci Jan 06 '21 at 20:25
-2

The dynamic way uses pd.eval(), as per [this solution][1]. This evaluates each row's formula individually, which allows df['formula'] to be different on each row, and nothing is hardcoded in your code. There's a huge amount going on in this one-liner, see the explanation in Notes below.

df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)

0    2
1    4
2    5
3    4
#    ^--- this is the result

and if you want to assign that result to a dataframe column, say df['z']:

  • df['z'] = df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)
  • alternatively you could use pd.eval(..., inplace=True), but then the formula would need to contain an actual assignment, e.g. 'z=a+b', and also the 'z' column would need to have been declared already: df['z'] = np.NaN. That part is slightly annoying to implement, so I didn't.

NOTES:

  1. we use pd.eval(...) to dynamically evaluate the ['formula'] column
    • ...using the pd.eval(.., local_dict=...) argument to pass in the variables for that row
  2. to evaluate an expression on each dataframe row, we use df.apply(..., axis=1). We have to provide some lambda function to tell it what to evaluate.
  3. So how does pd.eval() know how to map the strings a,b,c to their values on that individual row?
    • When we call df.apply(..., axis=1) row-wise like that, each row gets passed in as an individual Series, so within our apply(... axis=1), we can no longer reference the dataframe as df or its columns as df['a'], df['b'], ...
    • So instead we need to pass in that row as a Python dict, hence the local_dict=row.to_dict() argument to pd.eval, inside the lambda function.
  4. The pd.eval() approach can handle arbitrarily complicated formulas in the variables, not just simple sums; it can handle e.g. (a + c**2)/(b+c). You could reference external constants, or external functions e.g. log10.

References: [1]: Compute dataframe columns from a string formula in variables?

smci
  • 32,567
  • 20
  • 113
  • 146