Pandas Iterrows: Faster Alternatives?

Question

I am trying to speed up one of my functions. I have read that 'vectorization' is the fastest way to run these types of operations in Pandas, but how is this (or anything else faster) achievable with this code:

Dummy data

a = pd.DataFrame({'var1' : [33, 75, 464, 88, 34], 'imp_flag' : [1, 0, 0, 1, 1], 'donor_index' : [3, np.nan, np.nan, 4, 0]})

>>> a
   var1  imp_flag  donor_index
0    33         1          3.0
1    75         0          NaN
2   464         0          NaN
3    88         1          4.0
4    34         1          0.0

The operation in question

for index, row in a[a['imp_flag'] == 1].iterrows():
    new_row = a[a.index == row.donor_index]
    b = b.append(new_row)

Expected output

>>> b
   var1  imp_flag  donor_index
1    75         0          NaN
1    75         0          NaN
2   464         0          NaN

The output you provide does not match the return of your code. Is this expected? Can you clarify the question? — mozway, Jul 16 '21 at 07:48

Sbunzini · Answer 1 · 2021-07-16T08:00:46.543

There are several things that I do not understand

How is it possible that your dataFrame returns Nan values in donor_index if there are no Nan value when you create a

a = pd.DataFrame({'var1' : [33, 75, 464, 88, 34], 'imp_flag' : [1, 0, 0, 1, 1], 'donor_index' : [3, 1, 2, 4, 0]})

>>> a
   var1  imp_flag  donor_index
0    33         1            3
1    75         0            1
2   464         0            2
3    88         1            4
4    34         1            0

Are you sure that in your example the selection a[a['imp_flag'] == 1] is correct? It seems the way you get that results on b is the opposite a[a['imp_flag'] == 0]

Then, do you really need duplicated values in dataFrame b?

My solution is the following:

idxs = a[a.imp_flag == 0].donor_index
b = a.iloc[idxs]
# or in one-line b = a.iloc[a[a.imp_flag == 0].donor_index]

>>> b
   var1  imp_flag  donor_index
1    75         0            1
2   464         0            2

mozway · Answer 2 · 2021-07-16T07:46:39.260

0

IIUC, you want to subselect rows if imp_flag equals 1

You can simply use query to match the relevant rows:

b = a.loc[a.query('imp_flag == 1')['donor_index']]

Alternatively, you can index and select your data with:

b = a.loc[a[a['imp_flag'] == 1]['donor_index']]

output:

   var1  imp_flag  donor_index
3    88         1            4
4    34         1            0
0    33         1            3

edited Jul 16 '21 at 07:46

answered Jul 16 '21 at 07:40

mozway

194,879
13
39
75

Pandas Iterrows: Faster Alternatives?

Dummy data

The operation in question

Expected output

2 Answers2