31

I need to iterate over a pandas dataframe in order to pass each row as argument of a function (actually, class constructor) with **kwargs. This means that each row should behave as a dictionary with keys the column names and values the corresponding ones for each row.

This works, but it performs very badly:

import pandas as pd


def myfunc(**kwargs):
    try:
        area = kwargs.get('length', 0)* kwargs.get('width', 0)
        return area
    except TypeError:
        return 'Error : length and width should be int or float'


df = pd.DataFrame({'length':[1,2,3], 'width':[10, 20, 30]})

for i in range(len(df)):
    print myfunc(**df.iloc[i])

Any suggestions on how to make that more performing ? I have tried iterating with tried df.iterrows(), but I get the following error :

TypeError: myfunc() argument after ** must be a mapping, not tuple

I have also tried df.itertuples() and df.values , but either I am missing something, or it means that I have to convert each tuple / np.array to a pd.Series or dict , which will also be slow. My constraint is that the script has to work with python 2.7 and pandas 0.14.1.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Matina G
  • 1,452
  • 2
  • 14
  • 28
  • Try [DataFrame.iterrows](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html) – Itay Nov 14 '18 at 10:07
  • the by far slowest part in your code is the printing of the area. If i try it in python 3 with 10.000 rows i need 1.5 seconds with your variant (no printing), 0.9 seconds using itterrows() and over 3 seconds if i print the areas – Florian H Nov 14 '18 at 10:13
  • Thank you for your suggestion. I have tried that, but I do not seem to get how to acess column names for each element of the row.. As for the print, I only wrote that for the sake of executability of the code, it is the iteration performance that matters – Matina G Nov 14 '18 at 10:14

3 Answers3

64

one clean option is this one:

for row_dict in df.to_dict(orient="records"):
    print(row_dict['column_name'])
avloss
  • 2,389
  • 2
  • 22
  • 26
  • 2
    this is the best answer – Iván Apr 23 '20 at 14:26
  • 2
    according to latest docs this is now `orient='records'`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html – Roy Shilkrot Jun 23 '20 at 05:22
  • 1
    Or if you want the keys as well use "index" instead of "records". You then also have to use `items()` to iterate over the keys / values – David Waterworth Nov 27 '20 at 21:54
  • 2
    This is also the best way to iterate over rows without having the issues of **1)** coercing data types like `.iterrows()` does, or **2)** remaning columns with invalid Python identifiers like `itertuples()`does. – jfaccioni Feb 25 '21 at 17:07
24

You can try:

for k, row in df.iterrows():
    myfunc(**row)

Here k is the dataframe index and row is a dict, so you can access any column with: row["my_column_name"]

stellasia
  • 5,372
  • 4
  • 23
  • 43
  • 1
    Good Solution for the case but be aware iterrows are performance hitting in a large dataset [see here](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316) – Karn Kumar Nov 14 '18 at 10:55
  • 1
    That's true I just answered in order to make the iterrows works, but @jpp solution is probably better in terms of performances. – stellasia Nov 14 '18 at 10:58
  • In reality it's a __pd.Series__ not a __dict__. But it works of couse. – Diogo Santiago Feb 17 '23 at 15:08
1

Defining a separate function for this will be inefficient, as you are applying row-wise calculations. More efficient would be to calculate a new series, then iterate the series:

df = pd.DataFrame({'length':[1,2,3,'test'], 'width':[10, 20, 30,'hello']})

df2 = df.iloc[:].apply(pd.to_numeric, errors='coerce')

error_str = 'Error : length and width should be int or float'
print(*(df2['length'] * df2['width']).fillna(error_str), sep='\n')

10.0
40.0
90.0
Error : length and width should be int or float
jpp
  • 159,742
  • 34
  • 281
  • 339