Iterating over rows in dataframe: Why not: "for i in workingDF.index:"?

Question

First, let me say: I know I shouldn't be iterating over a dataframe per:

etc.

However, for my application I don't think I have a better option, although I am relatively new to python & pandas and may simply lack the knowledge. However, with my iteration, as I am iterating over rows, I need to access an adjacent row's data, which I can't figure out how to do with vectorization or list comprehension.

Which leaves me with iteration. I have seen several posts on iterrows() and itertuples() which will work. Before I found out about these though, i tried:

for i in workingDF.index:
    if i==0:
        list2Add = ['NaN']
        compareItem = workingDF.at[0,'name']
    else:
        if (workingDF.at[i,'name'] != compareItem):
            list2Add.append('NaN')
            compareItem = workingDF.at[i,'name']
        else:
            currentValue = workingDF.at[i,'value']
            yesterdayValue = workingDF.at[(i-1),'value']
            r =  currentValue - yesterdayValue
            list2Add.append(r)

Anyway, my naive code seemed to work fine/as intended (so far). So the question is: Is there some inherent reason not to use "for i in workingDF.index" in favor of the standard iterrows() and itertuples? (Presumably there must be since those are the "recommended" methods...)

Thanks in advance. Jim

EDIT: An example was requested. In this example each row contains a name, testNumber, and score. The example code creates a new column labelled "change" which represents the change of the current score compared to the most recent prior score. Example code:

import pandas as pd
def createDF():
    # list of name, testNo, score 
    nme2 = ["bob", "bob", "bob", "bob", "jim", "jim", "jim" ,"jim" ,"ed" ,"ed" ,"ed" ,"ed"] 
    tstNo2 = [1,2,3,4,1,2,3,4,1,2,3,4]
    scr2 = [82, 81, 80, 79,93,94,95,98,78,85,90,92] 
    # dictionary of lists  
    dict = {'name': nme2, 'TestNo': tstNo2, 'score': scr2}  
    workingDF = pd.DataFrame(dict) 
    return workingDF
def addChangeColumn(workingDF):
    """
    returns a Dataframe object with an added column named 
       "change" which represents the change in score compared to 
       most recent prior test result
    """
    for i in workingDF.index:
        if i==0:
            list2Add = ['NaN']
            compareItem = workingDF.at[0,'name']
        else:
            if (workingDF.at[i,'name'] != compareItem):
                list2Add.append('NaN')
                compareItem = workingDF.at[i,'name']
            else:
                currentScore = workingDF.at[i,'score']
                yesterdayScore = workingDF.at[(i-1),'score']
                r =  currentScore - yesterdayScore
                list2Add.append(r)

    modifiedDF = pd.concat([workingDF, pd.Series(list2Add, name ='change')], axis=1)
    return(modifiedDF)
if __name__ == '__main__':
    myDF = createDF()
    print('myDF is:')
    print(myDF)
    print()
    newDF = addChangeColumn(myDF)
    print('newDF is:')
    print(newDF)

Example Output:

myDF is:
name  TestNo  score
0   bob       1     82
1   bob       2     81
2   bob       3     80
3   bob       4     79
4   jim       1     93
5   jim       2     94
6   jim       3     95
7   jim       4     98
8    ed       1     78
9    ed       2     85
10   ed       3     90
11   ed       4     92

newDF is:
name  TestNo  score change
0   bob       1     82    NaN
1   bob       2     81     -1
2   bob       3     80     -1
3   bob       4     79     -1
4   jim       1     93    NaN
5   jim       2     94      1
6   jim       3     95      1
7   jim       4     98      3
8    ed       1     78    NaN
9    ed       2     85      7
10   ed       3     90      5
11   ed       4     92      2

Thank you.

Could you provide a sample data and expected output for us to figure out alternative possible solutions — YOLO, Jan 05 '20 at 06:16

score 1 · Accepted Answer · answered Jan 05 '20 at 04:20

1

In short, the answer is the performance benefit of using iterrows. This post could better explain the differences between the various options.

answered Jan 05 '20 at 04:20

Roshan Santhosh

677
3
9

Very helpful, thank you. My example is not explicitly listed but probably like the slowest example (4x slower than iterrows)... – gymshoe Jan 06 '20 at 00:37

score 0 · Answer 2 · answered Jan 16 '20 at 21:15

My problem is that I wanted to create a new column which was the difference of a value in the current row and a value in a prior row without using iteration.

I think the more "panda-esque" way of doing this (without iteration) would be to use dataframe.shift() to create a new column which contains the prior rows data shifted into the current row so all necessary data is available in the current row.

Iterating over rows in dataframe: Why not: "for i in workingDF.index:"?

2 Answers2