1. Use itertuples()
instead
Pandas DataFrames are really a collection of columns/Series objects (e.g. for x in df
iterates over the column labels), so even if a loop where to be implemented, it's better if the loop over across columns. iterrows()
is anti-pattern to that "native" pandas behavior because it creates a Series for each row, which slows down code so much. A better/faster option is to use itertuples()
. It creates namedtuples of each row that you can access either by index or column label. There's almost no modification to the code in the OP to apply it.
Also (as @Alireza Mazochi mentioned), to assign a value to a single cell, at
is faster than loc
.
for row in rche_df.itertuples():
# ^^^^^^^^^^ <------ `itertuples` instead of `iterrows`
if isinstance(row.wgs1984_latitude, float):
target = row.address_chi
dict_temp = geocoding(target)
rche_df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
rche_df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
# ^^ ^^^^^^^^^ <---- `at` instead of `loc` for faster assignment
# `row.Index` is the row's index, can also use `row[0]`
As you can see, using itertuples()
is almost the same syntax as using iterrows()
, yet it's over 6 times faster (you can verify it with a simple timeit
test).
2. to_dict()
is also an option
One drawback of itertuples()
is that whenever there's a space in a column label (e.g. 'Col A'
etc.), it will be mangled when converted into a namedtuple, so e.g. if 'address_chi'
was 'address chi'
, it will not be possible to access it via row.address chi
. One way to solve this problem is to convert the DataFrame into a dictionary and iterate over it.
Again, the syntax is almost the same as the one used for iterrows()
.
for index, row in rche_df.to_dict('index').items():
# ^^^^^^^^^^^^^^^^^^^^^^^^ <---- convert to a dict
if isinstance(row['wgs1984_latitude'], float):
target = row['address_chi']
dict_temp = geocoding(target)
rche_df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
rche_df.at[index, 'wgs1984_longitude'] = dict_temp['long']
This method is also about 6 times faster than iterrows()
but slightly slower than itertuples()
(also it's more memory-intensive than itertuples()
because it creates an explicit dictionary whereas itertuples()
creates a generator).
3. Iterate over only the necessary column/rows
The main bottleneck in the particular code in the OP (and in general, why a loop is sometimes necessary in a pandas dataframe) is that the function geocoding()
is not vectorized. So one way to make the code much faster is to call it only on the relevant column ('address_chi'
) and on the relevant rows (filtered using a boolean mask).
Note that creating the boolean mask was necessary only because there was an if-clause in the original code. If a conditional check was not needed, the boolean mask is not needed, so the necessary loop boils down to a single loop over a particular column (address_chi
).
# boolean mask to filter only the relevant rows
# this is analogous to if-clause in the loop in the OP
msk = [isinstance(row, float) for row in rche_df['wgs1984_latitude'].tolist()]
# call geocoding on the relevant values
# (filtered using the boolean mask built above)
# in the address_chi column
# and create a nested list
out = []
for target in rche_df.loc[msk, 'address_chi'].tolist():
dict_temp = geocoding(target)
out.append([dict_temp['lat'], dict_temp['long']])
# assign the nested list to the relevant rows of the original frame
rche_df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out
This method is about 40 times faster than iterrows()
.
A working example and performance test
def geocoding(x):
return {'lat': x*2, 'long': x*2}
def iterrows_(df):
for index, row in df.iterrows():
if isinstance(row.wgs1984_latitude, float):
target = row.address_chi
dict_temp = geocoding(target)
df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
df.at[index, 'wgs1984_longitude'] = dict_temp['long']
return df
def itertuples_(df):
for row in df.itertuples():
if isinstance(row.wgs1984_latitude, float):
target = row.address_chi
dict_temp = geocoding(target)
df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
return df
def to_dict_(df):
for index, row in df.to_dict('index').items():
if isinstance(row['wgs1984_latitude'], float):
target = row['address_chi']
dict_temp = geocoding(target)
df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
df.at[index, 'wgs1984_longitude'] = dict_temp['long']
return df
def boolean_mask_loop(df):
msk = [isinstance(row, float) for row in df['wgs1984_latitude'].tolist()]
out = []
for target in df.loc[msk, 'address_chi'].tolist():
dict_temp = geocoding(target)
out.append([dict_temp['lat'], dict_temp['long']])
df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out
return df
df = pd.DataFrame({'address_chi': range(20000)})
df['wgs1984_latitude'] = pd.Series([x if x%2 else float('nan') for x in df['address_chi'].tolist()], dtype=object)
%timeit itertuples_(df.copy())
# 248 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit boolean_mask_loop(df.copy())
# 38.7 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit to_dict_(df.copy())
# 289 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterrows_(df.copy())
# 1.57 s ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)