7

Let's say we have a dataframe with columns A, B and C:

df = pd.DataFrame(columns =('A','B','C'), index=range(1))

The columns holds three rows of numeric values:

0     A     B      C
1    2.1   1.8    1.6
2    2.01  1.81   1.58
3    1.9   1.84   1.52

How does one loop through every row from 1 to 3 and then execute an if statement including add some variables:

if B1 > 1.5
    calc_temp   = A1*10
    calc_temp01 = C1*-10
if B2 > 1.5 
    calc_temp   = A2*10
    calc_temp01 = C2*-10
if B3 >1.5
    calc_temp   = A3*10
    calc_temp01 = C3*-10

Is above even possible? It has to know a range of some sorts i.e. full range dataset number with some kind of counter, yes? The if statement should refer to that specific row.

Bondeaux
  • 174
  • 1
  • 3
  • 10
  • You can use `for row in df.iterrows():` – cs95 Aug 14 '17 at 08:24
  • Since you are replacing the already calculated values of `calc_temp` and `calc_temp01`, wouldn't the final values of `calc_temp` and `calc_temp01` be dependent on the values of `A` and `C` corresponding to the last occurrence of B greater than `1.5` – Clock Slave Aug 14 '17 at 10:48
  • Does this answer your question? [What is the most efficient way to loop through dataframes with pandas?](https://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas) – questionto42 Jul 08 '22 at 17:34

2 Answers2

17

I think you need iterrows:

for i, row in df.iterrows():
    if row['B'] > 1.5:
        calc_temp   = row['A'] *10
        calc_temp01 = row['C'] *-10
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    If you're accessing the row in a loop, why call `.loc` again? May as well use `for i in range(len(df)):` – cs95 Aug 14 '17 at 08:27
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ - sometimes overcomplicated it, thank you for succestion. – jezrael Aug 14 '17 at 08:37
  • @jezrael - Splendid, it worked. What if we were to plot calc_temp through the iteration while the plot keeps updating. How would one achieve that? – Bondeaux Aug 14 '17 at 10:14
  • @Bondeaux - Hard question, if need plotting. Maybe the best is create new question with sample data, desired output and what you try. Thank you. – jezrael Aug 14 '17 at 10:16
9

How to iterate efficiently

If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.

In short:

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
  • Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)
  • It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
  • Only use iterrows() if you cannot the previous solutions. See point (1)

Different methods to iterate over rows in a Pandas dataframe:

Generate a random dataframe with a million rows and 4 columns:

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)

1) The usual iterrows() is convenient, but damn slow:

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

2) The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

4) Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples

Romain Capron
  • 1,565
  • 1
  • 18
  • 23
  • Even three times faster again than itertuples is `for r in zip(*t.to_dict("list").values()):`, see [What is the most efficient way to loop through dataframes with pandas?](https://stackoverflow.com/a/34311080/11154841). – questionto42 Jul 08 '22 at 17:33