13

I have a Dataframe:

df = 
             A    B    C    D
DATA_DATE
20170103   5.0  3.0  NaN  NaN
20170104   NaN  NaN  NaN  1.0
20170105   1.0  NaN  2.0  3.0

And I have a series

s = 
DATA_DATE
20170103    4.0
20170104    0.0
20170105    2.2

I'd like to run an element-wise max() function and align s along the columns of df. In other words, I want to get

result = 
             A    B    C    D
DATA_DATE
20170103   5.0  4.0  NaN  NaN
20170104   NaN  NaN  NaN  1.0
20170105   2.2  NaN  2.2  3.0

What is the best way to do this? I've checked single column comparison and series to series comparison but haven't found an efficient way to run dataframe against a series.

Bonus: Not sure if the answer will be self-evident from above, but how to do it if I want to align s along the rows of df (assume dimensions match)?

Community
  • 1
  • 1
Zhang18
  • 4,800
  • 10
  • 50
  • 67

3 Answers3

10

This is called broadcasting and can be done as follows:

import numpy as np
np.maximum(df, s[:, None])
Out: 
             A    B    C    D
DATA_DATE                    
20170103   5.0  4.0  NaN  NaN
20170104   NaN  NaN  NaN  1.0
20170105   2.2  NaN  2.2  3.0

Here, s[:, None] will add a new axis to s. The same can be achieved by s[:, np.newaxis]. When you do this, they can be broadcast together because shapes (3, 4) and (3, 1) have a common element.

Note the difference between s and s[:, None]:

s.values
Out: array([ 4. ,  0. ,  2.2])

s[:, None]
Out: 
array([[ 4. ],
       [ 0. ],
       [ 2.2]])

s.shape
Out: (3,)

s[:, None].shape
Out: (3, 1)

An alternative would be:

df.mask(df.le(s, axis=0), s, axis=0)

Out: 
             A    B    C    D
DATA_DATE                    
20170103   5.0  4.0  NaN  NaN
20170104   NaN  NaN  NaN  1.0
20170105   2.2  NaN  2.2  3.0

This reads: Compare df and s. Where df is larger, use df, and otherwise use s.

ayhan
  • 70,170
  • 20
  • 182
  • 203
9

Data:

In [135]: df
Out[135]:
             A    B    C    D
DATA_DATE
20170103   5.0  3.0  NaN  NaN
20170104   NaN  NaN  NaN  1.0
20170105   1.0  NaN  2.0  3.0

In [136]: s
Out[136]:
20170103    4.0
20170104    0.0
20170105    2.2
Name: DATA_DATE, dtype: float64

Solution:

In [66]: df.clip_lower(s, axis=0)
C:\Users\Max\Anaconda4\lib\site-packages\pandas\core\ops.py:1247: RuntimeWarning: invalid value encountered in greater_equal
  result = op(x, y)
Out[66]:
             A    B    C    D
DATA_DATE
20170103   5.0  4.0  NaN  NaN
20170104   NaN  NaN  NaN  1.0
20170105   2.2  NaN  2.2  3.0

we can use the following hack in order to ged rid of the RuntimeWarning:

In [134]: df.fillna(np.inf).clip_lower(s, axis=0).replace(np.inf, np.nan)
Out[134]:
             A    B    C    D
DATA_DATE
20170103   5.0  4.0  NaN  NaN
20170104   NaN  NaN  NaN  1.0
20170105   2.2  NaN  2.2  3.0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    Much cleaner. :) – ayhan May 16 '17 at 22:25
  • I'm trying to reproduce the answer but I'm getting a bunch of Nan values. Any idea what I could be doing wrong? – Moondra May 16 '17 at 22:41
  • To complete the answer, obviously to align on the row one just need to set `axis=1`. As @MaxU mentioned earlier, making sure index and column name match between the DataFrame and Series is very important for this to work. – Zhang18 May 17 '17 at 13:33
  • btw, is there a way to suppress or get around that RuntimeWarning presumably due to the NaNs in the original dataframe? – Zhang18 May 17 '17 at 13:52
  • 1
    @Zhang18, i think it's because of NaN's. You can try this dirty hack: `df.fillna(np.inf).clip_lower(s, axis=0).replace(np.inf, np.nan)` – MaxU - stand with Ukraine May 17 '17 at 13:59
  • 1
    Turns out as long as there is at least one column without `NaN`, then `clip_lower` doesn't throw an error on the entire operation. Feels like a bug. I'll try to submit an issue request on GitHub. – Zhang18 May 17 '17 at 14:22
  • Upon further investigation the error has little to do with `NaN`s. It is thrown the first time if there are `NaN`s but is mute upon running the same `clip` command for subsequent times. An issue has been opened on Github: https://github.com/pandas-dev/pandas/issues/16378 – Zhang18 May 17 '17 at 15:33
0

While there may be better solutions for your problem, I believe this should give you what you need:

for c in df.columns:
    df[c] = pd.concat([df[c], s], axis=1).max(axis=1)
happyhuman
  • 1,541
  • 1
  • 16
  • 30