forloop over 500000 rows python dataframe

Question

I have a forloop that is changing the address format of over 500,000 rows, it works but it's taking a long time to run. Is there a way to make it run more efficiently?

for lab, row in df.iterrows():
    df.loc[lab,"Address"] = (row["Address"].title())

With 500K rows using a database would be a far better option. Although title-casing a field won't benefit from indexing. — Panagiotis Kanavos, Dec 19 '21 at 19:45
`df.Address.str.title()` I'm not sure if this is faster. String methods are usually not vectorized in `pandas`. — Michael Szczesny, Dec 19 '21 at 19:46
Does this answer your question? [What is the most efficient way to loop through dataframes with pandas?](https://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas) — Rivers, Dec 19 '21 at 19:47
@Rivers that doesn't answer the question at all. The answer's code is the same code used here — Panagiotis Kanavos, Dec 19 '21 at 19:47
@PanagiotisKanavos The accepted answer says "If you want it faster, use itertuples" — Barmar, Dec 19 '21 at 19:50
Benchmarked `pandas` `str.title()` with 100k rows: its ~650x faster. Always meassure. — Michael Szczesny, Dec 19 '21 at 19:53

Nicolai B. Thomsen · Answer 1 · 2021-12-19T20:14:29.860

0

You never want to use iterrows(). Indeed, you will want to stay away from any kind of custom row-wise iterations. Try this instead for your specific purpose.

df.assign(Address=lambda d: d["Address"].str.title())

It will return the dataframe with the updated column.

Here is the speed test as requested. It is just about 266x faster.

%timeit df.assign(Address=lambda d: d["Address"].str.title())
# 100 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit for lab, row in df.iterrows(): df.loc[lab,"Address"] = (row["Address"].title()) 
# 26.6 s ± 2.63 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Dec 19 '21 at 20:14

answered Dec 19 '21 at 19:50

Nicolai B. Thomsen

775
5
14

Since this creates a new df, might that expense negate the benefit of not using `iterrows()`? – Barmar Dec 19 '21 at 19:52
Not at all. It doesn't create a new df but simply adds the new column and returns the original object. – Nicolai B. Thomsen Dec 19 '21 at 19:53
1

The documentation says: "Returns a new object with all original columns in addition to new ones" – Barmar Dec 19 '21 at 19:54
I stand corrected. I believe it will still be significantly faster but I wil create an example to test it. – Nicolai B. Thomsen Dec 19 '21 at 19:56
Thanks academy, df.assign worked but I had to run it as df=df.assign etc to update the DF permanently. It ran in a heartbeat – Agatha Dec 19 '21 at 20:05
No problem. If this answered your question, please consider marking it as the correct answer. – Nicolai B. Thomsen Dec 19 '21 at 20:11

forloop over 500000 rows python dataframe

1 Answers1