1

I have to two dataframes. The first one has the index ranging from 3rd of December 2006 to 20th of December 2006 for example. The second one has the dates ranging from 2000 to 2020. I want to drop the rows from the second one when their dates are not within the first one's interval.

Consider the example below:

The first one is this one:

index                           value
'2006-12-03 13:06:21.955000'    3
'2006-12-03 13:14:54.100000'    4
'2006-12-04 13:23:25.929000'    5
'2006-12-05 13:31:58.074000'    6
'2006-12-05 13:40:29.903000'    7
'2006-12-05 13:49:02.048000'    8
'2006-12-06 13:57:33.877000'    9
.
.
.
'2006-12-20 14:06:06.022000'    100
'2006-12-20 14:14:37.851000'    110

The second one is this one:

id  date                name
.
.
.
39  2005-08-22 17:27:00 O
40  2005-09-07 17:40:00 F
41  2006-12-05 10:35:00 X
42  2006-12-13 02:40:00 F
43  2010-08-14 10:05:00 F
44  2011-03-07 20:12:00 M
45  2011-06-07 08:03:00 U
46  2011-08-04 04:12:00 M
47  2011-08-09 08:05:00 P
48  2011-09-22 11:01:00 L
49  2011-11-26 07:10:00 N
50  2012-01-23 03:59:00 M
51  2012-01-27 18:37:00 X
.
.
.

The desired one is the second one but edited as follows:

41  2006-12-05 10:35:00 X
42  2006-12-13 02:40:00 F

--> to remain only with the dates that are also present in the first one.

I tried the following command in order to drop the rows based on the condition:

second_df = second_df[(second_df.date < date_start_first) | (second_df.date > date_end_first)]

(I inspired from this answer.)

Unfortunately, the above line of code does not work ...

The date_start_first and date_end_first are extracted as follows:

date_start_first = getStartEndDatesOfDataframe(first_df, "start")
date_end_first = getStartEndDatesOfDataframe(first_df, "end")

Using this function:

def getStartEndDatesOfDataframe(dataSeriesName, start_or_end):
   if (start_or_end == "start"):
       date = dataSeriesName.index[0]
   else:
       date = dataSeriesName.index[len(dataSeriesName.index)-1]
   return date

Could you help me please with this problem?

P.S.: both "dates" have the same type, I verifies using type() function:

print(type(second_df.date[3]), type(first_df.index[3]))

And it gives:

<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Ariadne R.
  • 452
  • 11
  • 24

1 Answers1

1

Using string slicing together with apply and lambda:

import pandas as pd

df_1 = pd.DataFrame([
   ['2006-12-03 13:06:21.955000', 3],
   ['2006-12-03 13:14:54.100000', 4],
   ['2006-12-04 13:23:25.929000', 5],
   ['2006-12-05 13:31:58.074000', 6],
   ['2006-12-05 13:40:29.903000', 7],
   ['2006-12-05 13:49:02.048000', 8],
   ['2006-12-06 13:57:33.877000', 9]
], columns=["Date", "value"]
)

df_2 = pd.DataFrame([
    ["2005-08-22 17:27:00", "O"],
    ["2005-09-07 17:40:00", "F"],
    ["2006-12-05 10:35:00", "X"],
    ["2006-12-13 02:40:00", "F"],
    ["2010-08-14 10:05:00", "F"],
    ["2011-03-07 20:12:00", "M"],
    ["2011-06-07 08:03:00", "U"],
    ["2011-08-04 04:12:00", "M"],
    ["2011-08-09 08:05:00", "P"],
    ["2011-09-22 11:01:00", "L"],
    ["2011-11-26 07:10:00", "N"],
    ["2012-01-23 03:59:00", "M"],
    ["2012-01-27 18:37:00", "X"]
], columns=["Date", "name"]
)

df_1.set_index(["Date"], inplace=True)
dt = [d[:10] for d in df_1.index.values]
filt = df_2.Date.apply(lambda x: x[:10] in dt)
print(df_2[filt])

Produces:

                  Date name
2  2006-12-05 10:35:00    X
Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53