1

Good morning.

My problem is the following: I have a pandas dataframe with a column named 'fecha' that stores date objects and a list of tuples that stores an initial datetime and a final datetime. Show examples below:

df = 
     fecha
0    2018-10-01
1    2019-01-12
2    2018-12-25

list_ranges = [(datetime.datetime(2018, 10, 1, 0, 0), 
datetime.datetime(2018, 10, 15, 0, 0)),
(datetime.datetime(2018, 10, 16, 0, 0),
datetime.datetime(2018, 10, 31, 0, 0)),
(datetime.datetime(2018, 11, 1, 0, 0), datetime.datetime(2018, 11, 15, 0, 0)),
(datetime.datetime(2018, 11, 16, 0, 0),
datetime.datetime(2018, 11, 30, 0, 0)),
(datetime.datetime(2018, 12, 1, 0, 0), datetime.datetime(2018, 12, 15, 0, 0)),
(datetime.datetime(2018, 12, 16, 0, 0),
datetime.datetime(2018, 12, 31, 0, 0)),
(datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2019, 1, 15, 0, 0))] 

and I want to get the position of the range in which each date is on the list. The result I'm looking for is:

df = 
     result
0    1
1    7
2    6

Currently, I'm doing this:

df.fecha = df.fecha.apply(lambda x: select_quincena(x, quincenas))

def select_quincena(fecha, quincenas):

    fecha = datetime.datetime.combine(fecha, datetime.datetime.min.time())
    num   = 0

    for e in quincenas:
        num += 1

        if fecha >= e[0] and fecha <= e[1]:

            return num

It's working fine, but I'm pretty sure that there is a better and easier way to do this.

Thanks you very much in advance.

jpp
  • 159,742
  • 34
  • 281
  • 339
Á. Garzón
  • 345
  • 2
  • 15
  • If the column is an actual datetime column, then you can simply index with a boolean series: `df[(df['fetcha'] >= initial) & (df['fetcha'] < final)]` – alkasm Jan 19 '19 at 11:22

1 Answers1

0

Use pd.to_datetime, not the datetime module

First convert your series to Pandas datetime:

df['fecha'] = pd.to_datetime(df['fecha'])

Use pd.cut to bin categoricals

This is a binning problem. You can use pd.cut for a vectorised solution:

import numpy as np
from operator import itemgetter

dates = pd.to_datetime([list_ranges[0][0], *map(itemgetter(1), list_ranges)]).values
labels = np.arange(1, len(dates))

df['result'] = pd.cut(df['fecha'].values, bins=dates, labels=labels, include_lowest=True)

print(df)

       fecha result
0 2018-10-01      1
1 2019-01-12      7
2 2018-12-25      6

Related: Filtering and creating a column based on the date column

Use np.searchsorted for integer labels

Also possible in this case is a binary search to find the required insertion points. This is efficiently implemented in np.searchsorted.

df['result'] = np.searchsorted(dates, df['fecha'].values, side='right')

Dealing with Cannot cast array data errors

The above is tested to work on Pandas 0.23.4 / NumPy 1.15.1. For other versions, type comparison may not be consistent. In this case, you can convert datetime arrays to integers by replacing each extraction of .values by .values.astype('datetime64[D]').astype(int).

Related: Cannot cast array data from dtype('M8-ns') to dtype('float64') according to the rule 'safe'.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks, but when I use it I obtain the next error: TypeError: Cannot cast unfuc less input from dtype(' – Á. Garzón Jan 19 '19 at 11:46
  • @Á.Garzón, Does my update work? In other words, use `df['fecha'].values`. If that doesn't work use `.values.astype('datetime64[D]').astype(int)` everywhere you use `.values`. For what it's worth, the solution works on Pandas 0.23.4 / Python 3.7.0 / NumPy 1.15.1. – jpp Jan 19 '19 at 11:50
  • Sorry no, now the error is: TypeError: can't compare datetime.date to long – Á. Garzón Jan 19 '19 at 11:58
  • @Á.Garzón, See edit; don't use `datetime.date`, use `pd.to_datetime`. Then replace, if necessary, all instances of `.values` with `.values.astype('datetime64[D]').astype(int)`. – jpp Jan 19 '19 at 12:38
  • Yes, finally it works! Thanks! I just add your edit and use np.searchsorted. – Á. Garzón Jan 19 '19 at 12:58