How to optimize code that iterates on a big dataframe in Python

Question

I have a big pandas dataframe. It has thousands of columns and over a million rows. I want to calculate the difference between the max value and the min value row-wise. Keep in mind that there are many NaN values and some rows are all NaN values (but I still want to keep them!).

I wrote the following code. It works but it's time consuming:

totTime = []
for index, row in date.iterrows():
    myRow = row.dropna()
    if len(myRow):
        tt = max(myRow) - min(myRow)
    else:
        tt = None
    totTime.append(tt)

Is there any way to optimize it? I tried with the following code but I get an error when it encounters all NaN rows:

tt = lambda x: max(x.dropna()) - min(x.dropna())
totTime = date.apply(tt, axis=1)

Any suggestions will be appreciated!

You should avoid using loops and `apply` and search for a vectorised method like I suggested, most pandas ops will handle `NaN` gracefully you then need to filter out either before or after the operation — EdChum, Sep 16 '16 at 09:35

jotasi · Accepted Answer · 2016-09-16T11:00:28.137

2

It is usually a bad idea to use a python for loop to iterate over a large pandas.DataFrame or a numpy.ndarray. You should rather use the available build in functions on them as they are optimized and in many cases actually not written in python but in a compiled language. In your case you should use the methods pandas.DataFrame.max and pandas.DataFrame.min that both give you an option skipna to skip nan values in your DataFrame without the need to actually drop them manually. Furthermore, you can choose a axis to minimize along. So you can specifiy axis=1 to get the minimum along columns.

This will add up to something similar as what @EdChum just mentioned in the comments:

data.max(axis=1, skipna=True) - data.min(axis=1, skipna=True)

edited Sep 16 '16 at 11:00

answered Sep 16 '16 at 09:29

jotasi

5,077
2
29
51

2

`skipna` is default `True` so it's unnecessary: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html – EdChum Sep 16 '16 at 09:31
Oh good to know, thanks. I usually prefer explicit over implicit and will keep it thus. – jotasi Sep 16 '16 at 09:32

score 0 · Answer 2 · edited May 23 '17 at 12:15

0

I have the same problem about iterating. 2 points:

Why don't you replace NaN values with 0? You can do it with this df.replace(['inf','nan'],[0,0]). It replaces inf and nan values.
Take a look at this This. Maybe you can understand, I have a similar question about how to optimize the loop to calculate de difference between actual row with the previous one.

edited May 23 '17 at 12:15

Community

1
1

answered Sep 16 '16 at 09:28

Juliana Rivera

1,013
2
9
15

Hi Juliana. I do not replace NaN values with 0 because NaN in my case means means missing and 0 is a possible measurement. If I were to replace NaN with 0, my data would no longer represent my case. I also agree with the post you shared. Avoiding for loops is the key. – YamiOmar88 Sep 30 '16 at 15:02

How to optimize code that iterates on a big dataframe in Python

2 Answers2