Removing 'overlapping' dates from pandas dataframe

Question

I have a pandas dataframe that looks like the following:

ID  date       close
1   09/15/07   123.45
2   06/01/08   130.13
3   10/25/08   132.01
4   05/13/09   118.34
5   11/07/09   145.99
6   11/15/09   146.73
7   07/03/11   171.10

I want to remove any rows that overlap.

Overlapping rows is defined as any row within X days of another row. For example, if X = 365. then the result should be:

ID  date       close
1   09/15/07   123.45
3   10/25/08   132.01
5   11/07/09   145.99
7   07/03/11   171.10

If X = 50, the result should be:

ID  date       close
1   09/15/07   123.45
2   06/01/08   130.13
3   10/25/08   132.01
4   05/13/09   118.34
5   11/07/09   145.99
7   07/03/11   171.10

I've taken a look at a few questions on here but haven't find the right approach. For example, Pandas check for overlapping dates in multiple rows and Fastest way to eliminate specific dates from pandas dataframe are similar but don't quite get me what I need.

I have the following ugly code in place today that works for small X values but when X gets larger (e.g., when X = 365), it removes all dates except the original date.

filter_dates = []
for index, row in df.iterrows():
     if observation_time == 'D':
        for i in range(1, observation_period):
            filter_dates.append((index.date() + timedelta(days=i)))
df = df[~df.index.isin(filter_dates)]

Any help/pointers would be appreciated!

Clarification:

The solution to this needs to look at every row, not just the first row.

I think your problem is, that the data frame is immutable. Your intention is: start with the first date, erase all dates that are within X days.Go to the next remaining date and do the same. However, since the dataframe is immutable, you don't jump over any day (you iterate over every row as a starting point). This causes the "error". — Quickbeam2k1, Aug 10 '17 at 14:28
I think you're right @Quickbeam2k1 but there has to be a way to get around that. — Eric D. Brown D.Sc., Aug 10 '17 at 14:30
My answer passes the `X=365` and the `X=50` test. Sorry for answering so late, but I hat matters to do :) — Quickbeam2k1, Aug 10 '17 at 18:03

score 3 · Answer 1 · answered Aug 10 '17 at 15:13

3

You can add new column to filter the results:

df['filter'] = df['date'] - df['date'][0]
df['filter'] = df['filter'].apply(lambda x: x.days)

Then to filter by 365 use this:

df[df['filter']%365==0]

answered Aug 10 '17 at 15:13

zipa

27,316
6
40
58

Thanks. This looks like it will get me what I want (or at least close to it). Let me try it out. – Eric D. Brown D.Sc. Aug 10 '17 at 15:31
1

More like close, I just figured it is too strict :) – zipa Aug 10 '17 at 15:33
While this gets me close, no cigar just yet. I am going to play around with this a bit more though – Eric D. Brown D.Sc. Aug 10 '17 at 15:48

Cory Madden · Answer 2 · 2017-08-13T00:38:17.587

I found yet another solution to this(you can review edit history if you want to see the old ones). This is the best solution I've come up with. It still keeps the first sequential record, but it can be tweaked to keep the record that comes chronologically first(provided at the end).

target = df.iloc[0]  # Get the first item in the dataframe
day_diff = abs(target.date - df.date)  # Get the differences of all the other dates from the first item
day_diff = day_diff.reset_index().sort_values(['date', 'index'])  # Reset the index and then sort by date and original index so we can maintain the order of the dates
day_diff.columns = ['old_index', 'date']  # rename old index column because of name clash
good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values  # Group the dates by range and then get the first item from each group
df.iloc[good_ids]

Once again, I performed some tests to compare it to QuickBeam's method. DataFrame's used were a 600,000 rows randomly sorted and an ordered by date DataFrame with 73,000 rows:

My Method:

DataFrame             days           time
600k/random            2             1 loop, best of 3: 5.03 s per loop
ordered                2             1 loop, best of 3: 564 ms per loop


600k/random            50            1 loop, best of 3: 5.17 s per loop
ordered                50            1 loop, best of 3: 583 ms per loo


600k/random            365           1 loop, best of 3: 5.16 s per loop
ordered                365           1 loop, best of 3: 577 ms per loop

QuickBeam's Method:

DataFrame             days           time

600k/random            2             1 loop, best of 3: 52.8 s per loop
ordered                2             1 loop, best of 3: 4.89 s per loop


600k/random            50            1 loop, best of 3: 53 s per loop
ordered                50            1 loop, best of 3: 4.53 s per loop

600k/random            365           1 loop, best of 3: 53.7 s per loop
ordered                365           1 loop, best of 3: 4.49 s per loop

So yeah, maybe I'm a little competitive...

Exact functions used for testing:

def my_filter(df, days):
    target = df.iloc[0]
    day_diff = abs(target.date - df.date)
    day_diff = day_diff.reset_index().sort_values(['date', 'index'])
    day_diff.columns = ['old_index', 'date']
    good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values
    return df.iloc[good_ids]

def quickbeam_filter(df, days):
    filter_ids = [0]
    last_day = df.loc[0, "date"]
    for index, row in df[1:].iterrows():
         if (row["date"] - last_day).days > days:
             filter_ids.append(index)
             last_day = row["date"]
    return df.loc[filter_ids,:]

In case you want to get all the dates that start in a certain range, which makes more sense to me, you could use this version:

def my_filter(df, days):
    target = df.iloc[0]
    day_diff = abs(target.date - df.date)
    day_diff = day_diff.sort_values('date')
    good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().index.values
    return df.iloc[good_ids]

this totally depends on the data set, using `df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')})` and `X=2`my algorithm takes, on my machine, 7 seconds, whereas you algorithm takes 3 minuts and 16s. — Quickbeam2k1, Aug 12 '17 at 20:53
@Quickbeam2k1 They're both good solutions. :) But I tried your code, and my method was still faster. 4.7 seconds vs 1.5 seconds. — Cory Madden, Aug 12 '17 at 21:00
Which code do you mean? See my answer for the code i performed the benchmarks with. — Quickbeam2k1, Aug 12 '17 at 21:11
Yes, that's the code I was referring to. I meant that code for generating the DataFrame — Cory Madden, Aug 12 '17 at 21:12
Can't reproduce your result for X=2.But I think we don't need to go on here. — Quickbeam2k1, Aug 12 '17 at 21:22
No, they're both good solutions. But I appreciate the push into discovering a better way to do it. That's why I hang out here. — Cory Madden, Aug 12 '17 at 21:28
@Quickbeam2k1 I missed the part about `X = 2`. You're correct. Though, I'm afraid you'll find that I couldn't let it go. You can check my updated post for, what I believe, is the best solution. — Cory Madden, Aug 13 '17 at 00:27
Does your solution produce the correct results? Essentially you are only comparing with the first date, don't you? — Quickbeam2k1, Aug 13 '17 at 07:10

DJK · Answer 3 · 2017-08-10T16:57:49.507

My approach would be to first compute a distance matrix

distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])

In you example this would look like this

[[   0  260  406  606  784  792 1387]
 [ 260    0  146  346  524  532 1127]
 [ 406  146    0  200  378  386  981]
 [ 606  346  200    0  178  186  781]
 [ 784  524  378  178    0    8  603]
 [ 792  532  386  186    8    0  595]
 [1387 1127  981  781  603  595    0]]

Since were iterating downwards, we only care about distance from the top triangle, so we modify the array by keeping the top and setting the min of 365 to a large number M in this case I'll use 10,000

distM[np.triu(distM) <= 365] = 10000

Then take the argmin across the new distance matrix, to decide which rows of the dataframe to keep.

remove = np.unique(np.argmin(distM,axis=1))
df = df.iloc[remove,:]

and all together...

distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])

distM[np.triu(distM)<= 365] = 10000

remove = np.unique(np.argmin(distM,axis=1))

df = df.iloc[remove,:]

Quickbeam2k1 · Accepted Answer · 2017-08-12T21:10:01.443

I just used an elementary approach (essentially it's a tweaked version of the OP's approach), no fancy numpy or pandas ops but linear instead of quadratic complexity (when compard to the distance matrix approach).
However (as Cory Madden), I assume that the data is sorted with respect to the date column. I hope it's correct:

Dataframe -> I'm using pandas index here:

import pandas as pd
df = pd.DataFrame({'date': ["2007-09-15","2008-06-01","2008-10-25",
                            "2009-05-13","2009-11-07", "2009-11-15", "2011-07-03"],
                   'close':[123.45, 130.13, 132.01, 118.34, 
                            145.99, 146.73, 171.10]})
df["date"]=pd.to_datetime(df["date"])

The following block of code can easily be wrappen in a function and computs the correct dataframe indexes for X=365:

X = 365
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
     if (row["date"] - last_day).days > X:
         filter_ids.append(index)
         last_day = row["date"]

and the result:

print(df.loc[filter_ids,:])
    close       date
0  123.45 2007-09-15
2  132.01 2008-10-25
4  145.99 2009-11-07
6  171.10 2011-07-03

note that the indexes are shifted by one due to the index starting from zero.

I just wanted to comment on linear versus quadtratic complexity My solution has linear time complexity, seeing every row of the data frame exactly once. Cory maddens solution has quadratic complexity: in each iteration every row of the data frame is accessed. However, if X (the day difference) is large, we might discard a huge part of the data set end only perform very few iterations.

To this end, one might want to consider the following worst case scenario for X=2 the data set:

df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')})

On my machine the following codes yield:

%%timeit
X = 2
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
    if (row["date"] -last_day).days > X:
        filter_ids.append(index)
        last_day = row["date"]
1 loop, best of 3: 7.06 s per loop

and

day_diffs = abs(df.iloc[0].date - df.date).dt.days
i = 0
days = 2
idx = day_diffs.index[i]
good_ids = {idx}
while True:
    try:
        current_row = day_diffs[idx] 
        day_diffs = day_diffs.iloc[1:]
        records_not_overlapping = (day_diffs - current_row) > days         
        idx = records_not_overlapping[records_not_overlapping == True].index[0] 
        good_ids.add(idx)
except IndexError:  
    break
1 loop, best of 3: 3min 16s per loop

This looks like it might be a winner! Let me run some tests with it. — Eric D. Brown D.Sc., Aug 10 '17 at 18:17
I think this is it, although I need to work on 'business days' rather than 'days'. Your logic is sound — Eric D. Brown D.Sc., Aug 10 '17 at 19:01
i think [this](https://stackoverflow.com/questions/13019719/get-business-days-between-start-and-end-date-using-pandas), [this](https://stackoverflow.com/questions/30265711/python-pandas-numpy-direct-calculation-of-number-of-business-days-between) and [this](https://stackoverflow.com/questions/31841487/pandas-number-of-business-days-between-a-datetimeindex-and-a-timestamp) might help. You just need to count the business days between the two dates in the if condition. But the overall logic stays the same. — Quickbeam2k1, Aug 10 '17 at 19:40
+1 as I realize that my solution will consume too much memory over time, even when using another package to compute the matrix — DJK, Aug 10 '17 at 19:47
I ended up taking this approach. I had some help from a pandas guru and got things to work by adding an if statement in the for loop `if np.busday_count(last_day, overlapping_df.loc[index, "date"]) < X: remove_ids.append(index)` — Eric D. Brown D.Sc., Aug 13 '17 at 14:34

Eric D. Brown D.Sc. · Answer 5 · 2017-08-15T18:02:26.190

0

For anyone looking for the answer that worked for me, here it is (building on @Quickbeam2k1's answer):

X = 50 #or whatever value you want
remove_ids = []
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
    if np.busday_count(last_day, df.loc[index, "date"]) < X: 
        remove_ids.append(index)
    else:
        last_day = df.loc[index, "date"]

edited Aug 15 '17 at 18:02

answered Aug 13 '17 at 14:35

Eric D. Brown D.Sc.

1,896
7
25
37

1

When do you update `last_day`? Is something missing or is this really the full solution? – Quickbeam2k1 Aug 14 '17 at 17:48
ah...I missed the 'else' statement. updated to show it. – Eric D. Brown D.Sc. Aug 15 '17 at 18:01

Removing 'overlapping' dates from pandas dataframe

5 Answers5

Linked