1

I have a data-frame df comprising of an identifier column and 4 columns of dates. Please see below for the head of the data-frame.

   Identifier                    A \
0       12253  1989-11-09 00:00:00   
1       11413  1990-09-03 00:00:00   
2       12337  1977-09-07 00:00:00   
3       10610  1994-08-24 00:00:00   
4        7493  1993-08-22 00:00:00   

                      B                              C  \
0   2012-03-01 00:00:00            2015-04-01 00:00:00   
1   2009-08-16 00:00:00            2015-05-18 00:00:00   
2   1977-09-07 00:00:00            1977-09-07 00:00:00   
3   2009-01-09 00:00:00            2015-03-01 00:00:00   
4   2002-06-03 00:00:00            2015-02-16 00:00:00   

                      D  \
0   2012-01-03 00:00:00   
1   2015-05-18 00:00:00   
2   1977-09-07 00:00:00   
3   2015-03-01 00:00:00   
4   2015-02-16 00:00:00 

I am trying to compare the date columns against each other and where if A>=B>=C>=D then in a new column check True should be returned else False.

So the resulting data-frame would look like:

   Identifier                    A \
0       12253  1989-11-09 00:00:00   
1       11413  1990-09-03 00:00:00   
2       12337  1977-09-07 00:00:00   
3       10610  1994-08-24 00:00:00   
4        7493  1993-08-22 00:00:00   

                      B                              C  \
0   2012-03-01 00:00:00            2015-04-01 00:00:00   
1   2009-08-16 00:00:00            2015-05-18 00:00:00   
2   1977-09-07 00:00:00            1977-09-07 00:00:00   
3   2009-01-09 00:00:00            2015-03-01 00:00:00   
4   2002-06-03 00:00:00            2015-02-16 00:00:00   

                      D       Check\
0   2012-01-03 00:00:00       False
1   2015-05-18 00:00:00       True
2   1977-09-07 00:00:00       True 
3   2015-03-01 00:00:00       True
4   2015-02-16 00:00:00       True

I have tried

df['Check'] = np.where(df['A'] >= df['B']>= df['C']>= df['D'], 'True', 'False')

But can't get this to work. What can I try next?

halfer
  • 19,824
  • 17
  • 99
  • 186
Stacey
  • 4,825
  • 17
  • 58
  • 99

3 Answers3

2

Use DataFrame.eval:

df['Check'] = df.eval('A <= B <= C <= D')
print (df)
   Identifier          A          B          C          D  Check
0       12253 1989-11-09 2012-03-01 2015-04-01 2012-01-03  False
1       11413 1990-09-03 2009-08-16 2015-05-18 2015-05-18   True
2       12337 1977-09-07 1977-09-07 1977-09-07 1977-09-07   True
3       10610 1994-08-24 2009-01-09 2015-03-01 2015-03-01   True
4        7493 1993-08-22 2002-06-03 2015-02-16 2015-02-16   True
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @AntonvBR - I think `diff` shoud be more slowier in my opinion, also it compare datetimes, not integers – jezrael Oct 18 '18 at 10:44
  • @AntonvBR - I get `%timeit (df.eval('A <= B <= C <= D')) 3.92 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit ((df.diff(axis=1) >= pd.Timedelta(0)).all(1)) 30.7 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)` – jezrael Oct 18 '18 at 10:48
  • @AntonvBR - so it is data related ? because it is huge difference and I cannot see reason for it. – jezrael Oct 18 '18 at 11:08
  • Removed comments because of overflow. Sorry but why would it be such a big difference? – Anton vBR Oct 18 '18 at 12:25
  • @AntonvBR - I have no idea, sure. I try test your data. – jezrael Oct 18 '18 at 12:34
  • @AntonvBR - working nice, but timeings return `TypeError: numpy boolean subtract, the `-` operator, is deprecated, use the bitwise_xor, the `^` operator, or the logical_xor function instead.` :( What is your pandas version? – jezrael Oct 18 '18 at 12:36
  • @AntonvBR - better timeings? tested under windows and python 3 ? – jezrael Oct 18 '18 at 12:54
  • i found problem, not remove `df['check'] = ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1)` from code, so it try compare with Tiemdelta boolenam column – jezrael Oct 18 '18 at 12:56
  • I get `%timeit df.eval('A <= B <= C <= D') 158 ms ± 5.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1) 843 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Oct 18 '18 at 12:57
  • and for 10k rows `%timeit df.eval('A <= B <= C <= D') 2.64 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1) 925 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)` – jezrael Oct 18 '18 at 12:58
  • So for datafarmes to 10k is faster your solution, for really big not – jezrael Oct 18 '18 at 12:59
  • Nice :). Anyway, in this case your syntax is short but the one I have can be more "general". Good job! – Anton vBR Oct 18 '18 at 13:03
1

Use df.diff(), similar to np.diff() and check if any element is lower than pd.Timedelta(0).

df['check'] = ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1)

Full example:

import pandas as pd
import numpy

np.random.seed(333)

# Random dates from:  https://stackoverflow.com/questions/50559078/
def pp(start, end, n):
    start_u = start.value//10**9
    end_u = end.value//10**9
    return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n)).view('M8[ns]'))

n = 10

df = pd.DataFrame({
    'A': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n),
    'B': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n),
    'C': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n),
    'D': pp(pd.Timestamp('2018'), pd.Timestamp('2019'), n)
})

df['check'] = ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1)
print(df)

Returns:

                    A                   B                   C  \
0 2018-07-30 04:54:04 2018-03-13 00:28:13 2018-08-24 11:01:29   
1 2018-12-26 21:22:20 2018-09-23 14:25:11 2018-08-19 07:21:59   
2 2018-04-29 17:15:57 2018-05-28 12:35:35 2018-10-16 00:19:11   
3 2018-12-11 06:56:35 2018-08-15 00:12:12 2018-08-05 23:47:08   
4 2018-03-04 11:00:03 2018-07-03 07:22:30 2018-09-09 01:45:09   
5 2018-08-22 03:24:30 2018-12-17 17:38:34 2018-01-29 13:02:29   
6 2018-04-21 01:10:14 2018-06-09 20:37:08 2018-04-30 12:30:00   
7 2018-06-27 18:40:46 2018-09-15 10:26:06 2018-05-13 03:51:36   
8 2018-03-18 06:31:24 2018-11-10 06:24:12 2018-02-25 02:58:15   
9 2018-11-08 17:52:19 2018-03-27 01:02:12 2018-03-06 00:10:02   

                    D  check  
0 2018-07-30 16:16:03  False  
1 2018-07-21 23:38:59  False  
2 2018-10-25 03:46:37   True  
3 2018-12-01 07:43:53  False  
4 2018-12-07 16:11:31   True  
5 2018-09-17 14:58:20  False  
6 2018-07-02 09:36:35  False  
7 2018-03-16 23:21:27  False  
8 2018-10-30 11:24:01  False  
9 2018-04-03 12:17:52  False

Benchmarking

%timeit ~df.diff(axis=1).lt(pd.Timedelta(0)).any(1)
%timeit df.eval('A <= B <= C <= D')    

10.000 rows:

#1000 loops, best of 3: 1.58 ms per loop
#100 loops, best of 3: 3.31 ms per loop

10.000.000 rows:

#1 loop, best of 3: 2.27 s per loop
#1 loop, best of 3: 243 ms per loop
Community
  • 1
  • 1
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
0

There might be a smarter way, but this should work

df['Check'] = (df['A'] <= df['B']) & (df['B']<= df['C']) & (df['C'] <= df['D'])
ipramusinto
  • 2,310
  • 2
  • 14
  • 24