3

I have a DataFrame with timestamped temperature and wind speed values, and a function to convert those into a "wind chill." I'm using iterrows to run the function on each row, and hoping to get a DataFrame out with a nifty "Wind Chill" column.

However, while it seems to work as it's going through, and has actually "worked" at least once, I can't seem to replicate it consistently. I feel like it's something I'm missing about the structure of DataFrames, in general, but I'm hoping someone can help.

In [28]: bigdf.head()
Out[28]: 


                           Day  Temperature  Wind Speed  Year
2003-03-01 06:00:00-05:00  1    30.27        5.27        2003
2003-03-01 07:00:00-05:00  1    30.21        4.83        2003
2003-03-01 08:00:00-05:00  1    31.81        6.09        2003
2003-03-01 09:00:00-05:00  1    34.04        6.61        2003
2003-03-01 10:00:00-05:00  1    35.31        6.97        2003

So I add a 'Wind Chill' column to bigdf and prepopulate with NaN.

In [29]: bigdf['Wind Chill'] = NaN

Then I try to iterate over the rows, to add the actual Wind Chills.

In [30]: for row_index, row in bigdf[:5].iterrows():
    ...:     row['Wind Chill'] = windchill(row['Temperature'], row['Wind Speed'])
    ...:     print row['Wind Chill']
    ...:
24.7945889994
25.1365267133
25.934114012
28.2194307516
29.5051046953

As you can say, the new values appear to be applied to the 'Wind Chill' column. Here's the windchill function, just in case that helps:

def windchill(temp, wind):
    if temp>50 or wind<=3:
        return temp
    else:
        return 35.74 + 0.6215*temp - 35.75*wind**0.16 + 0.4275*temp*wind**0.16

But, when I look at the DataFrame again, the NaN's are still there:

In [31]: bigdf.head()
Out[31]: 

                           Day  Temperature  Wind Speed  Year  Wind Chill
2003-03-01 06:00:00-05:00  1    30.27        5.27        2003  NaN
2003-03-01 07:00:00-05:00  1    30.21        4.83        2003  NaN
2003-03-01 08:00:00-05:00  1    31.81        6.09        2003  NaN
2003-03-01 09:00:00-05:00  1    34.04        6.61        2003  NaN
2003-03-01 10:00:00-05:00  1    35.31        6.97        2003  NaN

What's even weirder is that it has worked once or twice, and I can't tell what I did differently.

I must admit I'm not especially familiar with the inner workings of pandas, and get confused with indexing, etc., so I feel like I'm probably missing something very basic here (or doing this the hard way).

Thanks!

wimsy
  • 33
  • 1
  • 5

3 Answers3

9

You can use apply to do this:

In [11]: df.apply(lambda row: windchill(row['Temperature'], row['Wind Speed']),
                 axis=1)
Out[11]:
2003-03-01 06:00:00-05:00    24.794589
2003-03-01 07:00:00-05:00    25.136527
2003-03-01 08:00:00-05:00    25.934114
2003-03-01 09:00:00-05:00    28.219431
2003-03-01 10:00:00-05:00    29.505105

In [12]: df['Wind Chill'] = df.apply(lambda row: windchill(row['Temperature'], row['Wind Speed']),
                                    axis=1)

In [13]: df
Out[13]:
                           Day  Temperature  Wind Speed  Year  Wind Chill
2003-03-01 06:00:00-05:00    1        30.27        5.27  2003   24.794589
2003-03-01 07:00:00-05:00    1        30.21        4.83  2003   25.136527
2003-03-01 08:00:00-05:00    1        31.81        6.09  2003   25.934114
2003-03-01 09:00:00-05:00    1        34.04        6.61  2003   28.219431
2003-03-01 10:00:00-05:00    1        35.31        6.97  2003   29.505105

.

To expand on the reason for your confusion, I think it stems from the fact that the row variables are copies rather than views of the df, so changes don't propagate:

In [21]: for _, row in df.iterrows(): row['Day'] = 2

We see that it is making the change successfully to the copy, the row variable(s):

In [22]: row
Out[22]:
Day               2.00
Temperature      35.31
Wind Speed        6.97
Year           2003.00
Name: 2003-03-01 10:00:00-05:00

Bu they don't update to the DataFrame:

In [23]: df
Out[23]:
                           Day  Temperature  Wind Speed  Year
2003-03-01 06:00:00-05:00    1        30.27        5.27  2003
2003-03-01 07:00:00-05:00    1        30.21        4.83  2003
2003-03-01 08:00:00-05:00    1        31.81        6.09  2003
2003-03-01 09:00:00-05:00    1        34.04        6.61  2003
2003-03-01 10:00:00-05:00    1        35.31        6.97  2003

The following also leaves df unchanged:

In [24]: row = df.ix[0]  # also a copy

In [25]: row['Day'] = 2

Whereas if we do take a view: (we'll see a change df.)

In [26]: row = df.ix[2:3]  # this one's a view

In [27]: row['Day'] = 3

See Returning a view versus a copy (in the docs).

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I suspected it had something to do with copies vs views, but I was thinking about it the opposite way and really confusing myself. Thanks for the detailed answer! – wimsy Apr 13 '13 at 13:58
  • I had a similar problem, with a similar solution, but here's the weird part: It WORKED on an older installation somehow, but not with newer versions of Pandas on other machines. That REALLY drove me nuts. So in case anyone else starts pulling their hair out on a similar problem, I thought I'd pass this along – ViennaMike Dec 29 '14 at 04:12
  • @ViennaMike are you saying the above worked on newer or older pandas? There are a few edge cases in pandas' apply which have been tweaked over last few releases so this could be one of them! – Andy Hayden Dec 29 '14 at 14:35
  • @AndyHayden, yes, I was using an older version of Anaconda which had a version 11, I believe, edition of Pandas, and was using iterrows, and the way I had it coded worked fine, with references to updates in the original data frame via row. But that didn't work (apparently row referenced a copy, not the original) when I tried on two later versions. Fixed it in my code with direct .loc references to the original dataframe. – ViennaMike Dec 29 '14 at 16:38
1

Try it with:

bigdf['Wind Chill'] = bigdf.apply(lambda x: windchill(x['Temperature'], x['Wind Speed']), axis=1)

for the whole DataFrame at once using your simple windchill function.

eumiro
  • 207,213
  • 34
  • 299
  • 261
1

I would say that you don't need any explicit loop. The following hopefully does what you want

bigdf = pd.DataFrame({'Temperature': [30.27, 30.21, 31.81], 'Wind Speed': [5.27, 4.83, 6.09]})

def windchill(temp, wind):
    "compute the wind chill given two pandas series temp and wind"
    tomodify = (temp<=50) & (wind>3) #check which values need to be modified
    t = temp.copy()  #create a new series
    # change only the values that need modification
    t[tomodify] = 35.74 + 0.6215*temp[tomodify] - 35.75*wind[tomodify]**0.16 +
        0.4275*temp[tomodify]*wind[tomodify]**0.16
    return t

bigdf['Wind Chill'] = windchill(bigdf['Temperature'], bigdf['Wind Speed'])

bigdf

   Temperature  Wind Speed  Wind Chill
0        30.27        5.27   24.794589
1        30.21        4.83   25.136527
2        31.81        6.09   25.934114

ps: this implementation of windchill works also with numpy arrays.

Francesco Montesano
  • 8,485
  • 2
  • 40
  • 64
  • Thanks. My googling revealed that reworking windchill was another option but I was really trying to figure what I was doing wrong the way it was. :) – wimsy Apr 13 '13 at 14:00
  • Gotcha. Good that you found the explanation – Francesco Montesano Apr 15 '13 at 08:02
  • I had a similar problem, with a similar solution, but here's the weird part: It WORKED on an older installation somehow, but not with newer versions of Pandas on other machines. That REALLY drove me nuts. So in case anyone else starts pulling their hair out on a similar problem, I thought I'd pass this along. – ViennaMike Dec 29 '14 at 04:11