0

I have a simple dataframe and would like to apply a function to a particular column based on the status of another column.

myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})

I would like to multiply each observation in the score column by 10, but only if the trial name is 'A'. If it is 'B', I would like the score to remain the same.

I have a working version that does what I need, but I'm wondering if there is a way I can do this without having to create the new dataframe. Not that it makes a huge difference, but eight lines of code seems pretty long and I'm guessing there might be a simpler solution I have overlooked.

newDF = pd.DataFrame(columns = ['trial','score'])
for row in myDF.iterrows():    
    if row[1][0] == 'A':
        newScore = {'trial': 'A', 'score': row[1][1]*10}
        newDF = newDF.append(newScore, ignore_index=True)
    else:
        newScore = {'trial': 'B', 'score': row[1][1]}
        newDF = newDF.append(newScore, ignore_index=True)
    
cookie1986
  • 865
  • 12
  • 27

3 Answers3

3

you can use loc to select the rows and columns to multiply by 10.

myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10

print(myDF)
  trial  score
0     A     10
1     B      2
2     A     30
3     B      4
4     A     50
5     B      6

and it will be much faster than a looping. ​

Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • I'm looking for the dup, I have the feeling it exists, but if not it sounds like a common question. was I wrong to use a wiki? – Ben.T Dec 04 '20 at 12:26
1

Pandas is pretty fast, and does the job well! But, functions like .iterrows() is dead slow compared to other methods. There are various articles on this topic, one such is linked here.

Now, you can simply use .apply() function. Which will work wonders - and you can even fit any custom functions!

Here is an example of your work:

myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)

You can even apply any functions using .apply and lambda as below:

def updateScore(trial, score):
    return score if trial != 'A' else score * 10

myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)

For more details, you can check the documentation.

Mr. Hobo
  • 530
  • 1
  • 7
  • 22
1

An alternative way to do this is using replace.

myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})
myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))

Performance

Below I tested the performance of the different solutions using a dataset of 100000 rows.

N=100000
myDF = pd.DataFrame({'trial': np.random.choice(['A', 'B'], N), 'score': np.random.choice(np.arange(0,10), N)})

Given solutions:

%timeit myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))

22.5 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10

6.4 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)

587 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def updateScore(trial, score):
    return score if trial != 'A' else score * 10

%timeit myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)

603 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33