Iterrows performance

Question

I'm working on python 2.7, pandas ( version 0.18.1 ) data frames. I have to modify a column in the data frame based on several columns in the same data frame.

For that I have written my code as below Sample data is like below

data is my dataframe

My sample data is like

+---+---+----+----+---+---------+---+----+----+---+----------+
| a | b | c  | d  | e |    f    | g | h  | i  | j | discount |
+---+---+----+----+---+---------+---+----+----+---+----------+
| 0 |   |    |    |   | 65497.6 |   |    |    |   |        0 |
| 0 |   |    |    |   | 73882.8 |   |    |    |   |        0 |
| 0 |   |    |    |   | 88588   |   | 22 |    |   |        0 |
| 0 |   |    |    |   | 106480  |   | 20 | 10 |   |        0 |
| 0 |   |    |    |   | 52500   |   |    |    |   |        0 |
| 0 |   | 20 | 10 |   | 22997.5 |   |    |    |   |        0 |
|   |   |    |    |   |         |   |    |    |   |        0 |
| 0 |   |    | 20 |   | 0       |   |    |    |   |        0 |
| 0 |   |    |    |   | 10520   |   |    |    |   |        0 |
+---+---+----+----+---+---------+---+----+----+---+----------+

And my code is like below

columns1 = ['a','b','c','d','e']
columns2 = ['f','g','h','i','j']
data['discount'] = 0
for i, row in data.iterrows():
    a = 0
    b = 0
    for col1 in columns1 :
      value = row[col1]
      if value > 0:
         a = value
         break;
    for col2 in columns2 :
      value = row[col2]
      if value > 0:
         b = value
         break;
    if( a != 0 and b != 0):
        data.loc[i, 'discount'] = abs(a-b)

As I'm doing this way it is taking lot of time and lot of memory on the large dataset. I have 700MB of data, It is taking more than 120GB of RAM to process and approximately after 10 hours process is giving the exception saying Memory Error

according to this https://stackoverflow.com/a/24871316, I should not use like that, Please let me know how can I write this code more efficient.

Please let me know the reason for down voting my question, so that I can learn

please provide a sample of the data so we can experiment. – piRSquared Sep 09 '16 at 19:17 — piRSquared, Sep 09 '16 at 19:17

score 0 · Accepted Answer · answered Sep 09 '16 at 20:49

Assuming your empty cells are NaN values, this gives you the first non-NA value of each row for the group of columns you are interested in:

df[df>0][columns1].bfill(axis=1).iloc[:,0]

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5    20.0
6     NaN
7    20.0
8     NaN

Thus, this will give you the abs(a-b) you're searching for:

res = (df[df>0][columns1].bfill(axis=1).iloc[:,0]
      -df[df>0][columns2].bfill(axis=1).iloc[:,0]).abs()
res

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5    22977.5
6        NaN
7        NaN
8        NaN

You can either combine it with your initialized discount column:

res.combine_first(df.discount)

or fill the blanks:

res.fillna(0)

Iterrows performance

1 Answers1

Linked