65

I am trying to iterate over the rows of a Python Pandas dataframe. Within each row of the dataframe, I am trying to to refer to each value along a row by its column name.

Here is what I have:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
print df
          A         B         C         D
0  0.351741  0.186022  0.238705  0.081457
1  0.950817  0.665594  0.671151  0.730102
2  0.727996  0.442725  0.658816  0.003515
3  0.155604  0.567044  0.943466  0.666576
4  0.056922  0.751562  0.135624  0.597252
5  0.577770  0.995546  0.984923  0.123392
6  0.121061  0.490894  0.134702  0.358296
7  0.895856  0.617628  0.722529  0.794110
8  0.611006  0.328815  0.395859  0.507364
9  0.616169  0.527488  0.186614  0.278792

I used this approach to iterate, but it is only giving me part of the solution - after selecting a row in each iteration, how do I access row elements by their column name?

Here is what I am trying to do:

for row in df.iterrows():
    print row.loc[0,'A']
    print row.A
    print row.index()

My understanding is that the row is a Pandas series. But I have no way to index into the Series.

Is it possible to use column names while simultaneously iterating over rows?

Community
  • 1
  • 1
edesz
  • 11,756
  • 22
  • 75
  • 123
  • 4
    `row` in your example is not a Series, it should be a tuple. But if you do `for idx, row in df.iterrows()`, `row['A']` should work fine? – ayhan Apr 25 '17 at 20:06
  • That's what I was missing! Thanks. – edesz Apr 25 '17 at 20:09
  • 1
    Most numeric operations with pandas can be vectorized - this means they are much faster than conventional iteration. OTOH, some operations (such as string and regex) are inherently hard to vectorize. This this case, it is important to understand _how_ to loop over your data. More more information on when and how looping over your data is to be done, please read [For loops with Pandas - When should I care?](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care/54028200#54028200). – cs95 Jan 04 '19 at 10:17

5 Answers5

94

I also like itertuples()

for row in df.itertuples():
    print(row.A)
    print(row.Index)

since row is a named tuples, if you meant to access values on each row this should be MUCH faster

speed run :

df = pd.DataFrame([x for x in range(1000*1000)], columns=['A'])
st=time.time()
for index, row in df.iterrows():
    row.A
print(time.time()-st)
45.05799984931946

st=time.time()
for row in df.itertuples():
    row.A
print(time.time() - st)
0.48400020599365234
Steven G
  • 16,244
  • 8
  • 53
  • 77
  • Thanks! I think this is actually what I had in mind (but could not remember). It is much more practical (since there is no need for `idx`, like having to `enumerate` a list). Since I asked for for `iterrows()`, I'll go with that answer. But this is what I would have used had I remembered. – edesz Apr 25 '17 at 20:13
  • 2
    `print(row.Index)` results in: `AttributeError: 'tuple' object has no attribute 'Index'` – kiltek Jun 13 '17 at 07:35
  • @kiltek did you use `itertuples(index=False)` ? if not, i would need some code to figure out whats wrong – Steven G Jun 13 '17 at 12:08
  • 2
    @WR This should be the accepted answer. It is [up to 50x faster](https://github.com/mm-mansour/Fast-Pandas). – bbennett36 Apr 08 '18 at 18:04
  • As I understand it, if the number of columns is greater than 255, the tuples returned are not named. Is there any way to overwrite this and produce named tuples for ~3000 columns? I want to eventually grab those column names that meet a condition. @Steven G – abbas786 Sep 18 '19 at 19:35
  • @abbas786 sadly no, the 255 column limit is a `namedtuple` limit. not a pandas limit. – Steven G Sep 18 '19 at 21:40
35

The item from iterrows() is not a Series, but a tuple of (index, Series), so you can unpack the tuple in the for loop like so:

for (idx, row) in df.iterrows():
    print(row.loc['A'])
    print(row.A)
    print(row.index)

#0.890618586836
#0.890618586836
#Index(['A', 'B', 'C', 'D'], dtype='object')
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • @StevenG Yeah. That's what I meant to say. I guess it's clearer if we say (index, Series). – Psidom Apr 25 '17 at 20:10
  • use itertuples() as suggested in the second answer...If you are working with a large dataframe intertuples is a lot faster – Megha Feb 23 '18 at 01:49
31

How to iterate efficiently

If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.

In short:

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
  • Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)
  • It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
  • Only use iterrows() if you cannot use the previous solutions. See point (1)

Different methods to iterate over rows in a Pandas dataframe:

Generate a random dataframe with a million rows and 4 columns:

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)
  1. The usual iterrows() is convenient, but damn slow:

    start_time = time.clock()
    result = 0
    for _, row in df.iterrows():
        result += max(row['B'], row['C'])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  2. The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row.B, row.C)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  3. The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.

    start_time = time.clock()
    result = 0
    for(_, col1, col2, col3, col4) in df.itertuples(name=None):
        result += max(col2, col3)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  4. Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
    

Output:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples

Janosh
  • 3,392
  • 2
  • 27
  • 35
Romain Capron
  • 1,565
  • 1
  • 18
  • 23
2

This was not as straightforward as I would have hoped. You need to use enumerate to keep track of how many columns you have. Then use that counter to look up the name of the column. The accepted answer does not show you how to access the column names dynamically.

for row in df.itertuples(index=False, name=None):
    for k,v in enumerate(row):
        print("column: {0}".format(df.columns.values[k]))
        print("value: {0}".format(v)
devdrc
  • 1,853
  • 16
  • 21
-1
for i in range(1,len(na_rm.columns)):
           print ("column name:", na_rm.columns[i])

Output :

column name: seretide_price
column name: symbicort_mkt_shr
column name: symbicort_price
geisterfurz007
  • 5,292
  • 5
  • 33
  • 54
Avik Das
  • 1
  • 1