1

What is the (best practice) correct way to iterate over DataFrames?

I am using:

for i in range(working.shape[0]):
    for j in range(1, working.shape[1]):
        working.iloc[i,j] = (100 - working.iloc[i,j])*100

The above is correct but does not line up with other Stack Overflow answers. I was hoping that someone could explain why the above is not optimal and suggest a superior implementation.

I am very much a novice in programming in general and Pandas in particular. Also apologies for asking a question which has already been addressed on SF: I didn't really understand the standing answers to this though. possible duplicate but this answer is easy to understand for a novice, if less comprehensive.

Tikhon
  • 451
  • 1
  • 5
  • 13
  • Fantastic, thank you very much! However, my code omits the first column - can I use applymap more selectively? – Tikhon Aug 29 '19 at 23:11
  • 1
    see this [answer](https://stackoverflow.com/a/55557758/9274732) for more information about how to NOT iterate over a dataframe – Ben.T Aug 29 '19 at 23:18

1 Answers1

4

What is the (best practice) correct way to iterate over DataFrames?

There are several ways (for example iterrows) but in general, you should try to avoid iteration at all costs. pandas offer several tools for vectorized operations which will almost always be faster than an iterative solution.

The example you provided can be vectorized in the following way using iloc:

working.iloc[:, 1:] = (100 - working.iloc[:, 1:]) * 100

Some timings:

from timeit import Timer

working = pd.DataFrame({'a': range(50), 'b': range(50)})


def iteration():
    for i in range(working.shape[0]):
        for j in range(1, working.shape[1]):
            working.iloc[i, j] = (100 - working.iloc[i, j]) * 100


def direct():
    # in actual code you will have to assign back to working.iloc[:, 1:]
    (100 - working.iloc[:, 1:]) * 100


print(min(Timer(iteration).repeat(50, 50)))
print(min(Timer(direct).repeat(50, 50)))

Outputs

0.38473859999999993
0.05334049999999735

A 7-factor difference and that's with only 50 rows.

DeepSpace
  • 78,697
  • 11
  • 109
  • 154