Is there a more concise way to conditionally loop over rows in a dataframe?

Question

I have a simple dataframe and would like to apply a function to a particular column based on the status of another column.

myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})

I would like to multiply each observation in the score column by 10, but only if the trial name is 'A'. If it is 'B', I would like the score to remain the same.

I have a working version that does what I need, but I'm wondering if there is a way I can do this without having to create the new dataframe. Not that it makes a huge difference, but eight lines of code seems pretty long and I'm guessing there might be a simpler solution I have overlooked.

newDF = pd.DataFrame(columns = ['trial','score'])
for row in myDF.iterrows():    
    if row[1][0] == 'A':
        newScore = {'trial': 'A', 'score': row[1][1]*10}
        newDF = newDF.append(newScore, ignore_index=True)
    else:
        newScore = {'trial': 'B', 'score': row[1][1]}
        newDF = newDF.append(newScore, ignore_index=True)

Kindly post your expected output – sammywemmy Dec 04 '20 at 12:21 — sammywemmy, Dec 04 '20 at 12:21

score 3 · Answer 1 · answered Dec 04 '20 at 12:21

3

you can use loc to select the rows and columns to multiply by 10.

myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10

print(myDF)
  trial  score
0     A     10
1     B      2
2     A     30
3     B      4
4     A     50
5     B      6

and it will be much faster than a looping.

answered Dec 04 '20 at 12:21

Ben.T

29,160
6
32
54

I'm looking for the dup, I have the feeling it exists, but if not it sounds like a common question. was I wrong to use a wiki? – Ben.T Dec 04 '20 at 12:26

score 1 · Accepted Answer · answered Dec 04 '20 at 12:27

1

Pandas is pretty fast, and does the job well! But, functions like .iterrows() is dead slow compared to other methods. There are various articles on this topic, one such is linked here.

Now, you can simply use .apply() function. Which will work wonders - and you can even fit any custom functions!

Here is an example of your work:

myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)

You can even apply any functions using .apply and lambda as below:

def updateScore(trial, score):
    return score if trial != 'A' else score * 10

myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)

For more details, you can check the documentation.

answered Dec 04 '20 at 12:27

Mr. Hobo

530
1
7
22

In my opinion apply here is bad choice. – jezrael Dec 04 '20 at 12:36
https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code – jezrael Dec 04 '20 at 12:36
Yes, there are many other methods. Like the one given by @BenT below is a very nice solution. I just wanted to show him a general approach to solving the problem. – Mr. Hobo Dec 04 '20 at 12:44
ya, depends what need OP, but mainly problem is `apply` are loops, so necessary avoid like here if need performance. – jezrael Dec 04 '20 at 12:51
For your interest @jezrael, I added under my own solution the performance of the different solutions. – Ruthger Righart Dec 04 '20 at 13:32
@Mr. Hobo, please see the performance of the different solutions. – Ruthger Righart Dec 04 '20 at 13:33

Ruthger Righart · Answer 3 · 2020-12-04T13:28:15.020

An alternative way to do this is using replace.

myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})
myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))

Performance

Below I tested the performance of the different solutions using a dataset of 100000 rows.

N=100000
myDF = pd.DataFrame({'trial': np.random.choice(['A', 'B'], N), 'score': np.random.choice(np.arange(0,10), N)})

Given solutions:

%timeit myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))

22.5 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10

6.4 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)

587 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def updateScore(trial, score):
    return score if trial != 'A' else score * 10

%timeit myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)

603 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there a more concise way to conditionally loop over rows in a dataframe?

3 Answers3

Performance