-1

I need to pass 2 dates manually to variables and I need to filter a pandas dataframe, where the date values are between the 2 date variables.

The issue I'm facing is, the column in the dataframe is of type 'datetime64[ns]' but the value contains only the date (for eg: yyyy-mm-dd. No time.) and to be able to filter based on the condition, it tried to pass a date manually to a variable by doing the following:

start = np.array(np.datetime64('2020-03-01'))
end = np.array(np.datetime64('2020-03-31'))

Then, I'm trying to filter the results by doing the following:

filter1 = df1['TRANSACTION_DT'] > start
filter2 = df1['TRANSACTION_DT'] < end

df1.where(filter1 & filter2, inplace=True)

However, when I'm doing this, I'm getting all nulls. How can I declare datetime64 variables and manually pass a date so that I can filter my dataframe.

Please help! Thank you!

PKV
  • 167
  • 3
  • 13
  • Does this answer your question? [How to select rows in a DataFrame between two values, in Python Pandas?](https://stackoverflow.com/questions/31617845/how-to-select-rows-in-a-dataframe-between-two-values-in-python-pandas) – RichieV Sep 09 '20 at 02:19
  • @RichieV Not exactly. I'm having an issue because I need to manually pass datetime64 value and then compare. The example you mentioned is comparing a value which is an integer. Though the logic is similar, I'm specifically having issue with the datetime64 type variable comparison. – PKV Sep 10 '20 at 21:42

3 Answers3

1

You could use simple clustering along with & clause

import pandas as pd
from datetime import datetime


df = pd.DataFrame({
    "date": [datetime(2020,1, 2), datetime(2019, 11, 2), datetime(2018, 1, 2)],
})
df = df[(df['date'] > '2019-10-10') & (df['date'] < '2019-12-01')]

df = pd.DataFrame({
    "date": ["2020-01-02", "2019-11-02", "2018-01-02"],
df = df[(df['date'] > '2019-10-10') & (df['date'] < '2019-12-01')]
  • here you can see that the filtering works seamlessly when the column is of `datetime.datetime` and also when it is a string – kshitij mathur Sep 09 '20 at 01:45
1

This does seem to work. However, the problem is the use of .where(). This replaces the values in the columns where the condition is not met by False. This is not a valid np.datetime64 value, so it gets replaced by NaT, the datetime equivalent of a nan value.

If you only want to look at those rows for which the filter is true, you should use:

df1.loc[filter1 & filter2]

instead.

1

Solution

You could try this:

df.loc[df.TRANSACTION_DT.between('2020-07-03', '2020-07-08', inclusive=False)]

In a nutshell, you could try any of the following methods. I personally find Method-2 as the most versatile one. In case you are only interested to find dates between two given dates, you could also use Method-1 (pd.Series.between()).

# import numpy as np
# import pandas as pd

# start = '2020-07-03'
# end = '2020-07-08'

# One line solution with pd.DataFrame.where()
# Method-1
df.where(df.TRANSACTION_DT.between(start, end, inclusive=False))
# Method-2
df.where((df.TRANSACTION_DT > start) & (df.TRANSACTION_DT < end))
# Method-3
df.where(np.logical_and((df.TRANSACTION_DT > start), (df.TRANSACTION_DT < end)))

For example, if you use Method-1, and only want to see which rows in the dataframe satisfy the filtering condition(s), you could do it as follows:

# Note: I used the DUMMY DATA created 
#       in the following section for df
df.loc[df.TRANSACTION_DT.between(start, end, inclusive=False)]

## Output:
#
#   TRANSACTION_DT  Method_1  Method-2
# 3     2020-07-04      True      True
# 4     2020-07-05      True      True
# 5     2020-07-06      True      True
# 6     2020-07-07      True      True

Example: With Dummy Data

import numpy as np
import pandas as pd

start = '2020-07-03'
end = '2020-07-08'

## Dummy Data
dates = pd.date_range(start='2020-07-01', end='2020-07-10', freq='D')
df = pd.DataFrame(dates, columns=['TRANSACTION_DT'])

## Apply Criterion
# Method-1
df['Method_1'] = df.TRANSACTION_DT.between(start, end, inclusive=False)
# Method-2
df['Method_2'] = (df.TRANSACTION_DT > start) & (df.TRANSACTION_DT < end)
print(df)

Output:

  TRANSACTION_DT  Method_1  Method_2
0     2020-07-01     False     False
1     2020-07-02     False     False
2     2020-07-03     False     False
3     2020-07-04      True      True
4     2020-07-05      True      True
5     2020-07-06      True      True
6     2020-07-07      True      True
7     2020-07-08     False     False
8     2020-07-09     False     False
9     2020-07-10     False     False
CypherX
  • 7,019
  • 3
  • 25
  • 37